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Preface 



This volume contains all the papers presented at the International Conference on 
Algorithmic Learning Theory 1999 (ALT’99), held at Waseda University Inter- 
national Conference Center, Tokyo, Japan, December 6—8, 1999. The conference 
was sponsored by the Japanese Society for Artificial Intelligence (JSAI). 

In response to the call for papers, 51 papers on all aspects of algorithmic 
learning theory and related areas were submitted, of which 26 papers were se- 
lected for presentation by the program committee based on their originality, 
quality, and relevance to the theory of machine learning. In addition to these 
regular papers, this volume contains three papers of invited lectures presented 
by Katharina Morik of the University of Dortmund, Robert E. Schapire of AT&T 
Labs, Shannon Lab., and Kenji Yamanishi of NEC, C&C Media Research Lab. 

ALT’99 is not just one of the ALT conference series, but this conference 
marks the tenth anniversary in the series that was launched in Tokyo, in Octo- 
ber 1990, for the discussion of research topics on all areas related to algorithmic 
learning theory. The ALT series was renamed last year from “ALT workshop” to 
“ALT conference” , expressing its wider goal of providing an ideal forum to bring 
together researchers from both theoretical and practical learning communities, 
producing novel concepts and criteria that would benefit both. This movement 
was reflected in the papers presented at ALT’99, where there were several papers 
motivated by application oriented problems such as noise, data precision, etc. 
Furthermore, ALT’99 benefited from being held jointly with the 2nd Interna- 
tional Conference on Discovery Science (DS’99), the conference for discussing, 
among other things, more applied aspects of machine learning. Also, we could 
celebrate the tenth anniversary of the ALT series with researchers from both 
theoretical and practical communities. 

This year we started the E Mark Gold Award for the most outstanding 
paper by a student author, selected by the program committee of the conference. 
This year’s award was given to Yuri Kalnishkan for his paper “General Linear 
Relations among Different Types of Predictive Complexity” . 

We wish to thank all who made this conference possible, first of all, the 
authors for submitting papers and the three invited speakers for their excellent 
presentations and their contributions of papers to this volume. 

We are indebted to all members of the program committee: Nader Bshouty 
(Technion, Israel), Satoshi Kobayashi (Tokyo Denki Univ., Japan), Gabor Lu- 
gosi (Pompeu Fabra Univ., Spain), Masayuki Numao (Tokyo Inst, of Tech., 
Japan), Robert Schapire (ATT Shannon Lab., USA), Arun Sharma (New South 
Wales, Australia), John Shawe-Taylor (Univ. of London, UK), Ayumi Shino- 
hara (Kyushu Univ., Japan), Prasad Tadepalli (Oregon State Univ., USA), Jun- 
ichi Takeuchi (NEC C&C Media Research Lab., Japan), Akihiro Yamamoto 
(Hokkaido Univ., Japan), Rolf Wiehagen (Univ. Kaiserslautern, Germany), and 
Thomas Zeugmann (Kyushu Univ., Japan). They and the subreferees (listed sep- 
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arately) put a huge amount of work into reviewing the submissions and judging 
their importance and significance. 

We also gratefully acknowledge the work of all those who did important 
jobs behind the scenes to make this volume as well as the conference possible. 
We thank Akira Maruoka for providing valuable suggestions, Shigeki Goto for 
the initial arrangement of a conference place, Naoki Abe for arranging an in- 
vited speaker, Shinichi Shimozono for producing the ALT99 logo, Isao Saito for 
drawing the ALT99 posters, and Springer- Verlag for their excellent support in 
preparing this volume. 

Last but not least, we are very grateful to all the members of the local ar- 
rangement committee: Taisuke Sato (chair), Satoru Miyano, Ayumi Shinohara, 
without whose efforts this conference would not have been successful. 



Tokyo, August 1999 



Osamu Watanabe 
Takashi Yokomori 
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Abstract. Designing the representation languages for the input and 
output of a learning algorithm is the hardest task within machine learn- 
ing applications. Transforming the given representation of observations 
into a well-suited language Le may ease learning such that a simple and 
efficient learning algorithm can solve the learning problem. Learnability 
is defined with respect to the representation of the output of learning, 
Lh- If the predictive accuracy is the only criterion for the success of 
learning, the choice of Lh means to find the hypothesis space with most 
easily learnable concepts, which contains the solution. Additional criteria 
for the success of learning such as comprehensibility and embeddedness 
may ask for transformations of Lh such that users can easily interpret 
and other systems can easily exploit the learning results. Designing a lan- 
guage Lh that is optimal with respect to all the criteria is too difficult 
a task. Instead, we design families of representations, where each family 
member is well suited for a particular set of requirements, and implement 
transformations between the representations. In this paper, we discuss a 
representation family of Horn logic. Work on tailoring representations is 
illustrated by a robot application. 



1 Introduction 

Machine learning has focused on the particular learning task of concept learning. 
Investigating this task has led to sound theoretical results as well as to efficient 
learning systems. However, some aspects of learning have not yet received the 
attention they deserve. Theoretical results are missing where they are urgently 
needed for successful applications of machine learning. If we contrast the theo- 
retically analyzed setting of concept learning with the one found in real-world 
applications, we encounter quite a number of open research questions. Let me 
draw your attention to those questions that are related with tailoring represen- 
tations. 

The setting of concept learning that has been well investigated can be sum- 
marized as follows. 

Concept learning: Given a set of examples in a representation language Le, 
drawn according to some distribution, and background knowledge in a rep- 
resentation language Lb, 
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learn a hypothesis in a representation language Lh that classifies further 
examples properly according to a given criterion. 

Different paradigms vary assumptions about the distribution and about the ex- 
istence of background knowledge. In the paradigm of probably approximately 
correct (PAC) learning, for instance, an unknown but fixed distribution is as- 
sumed and background knowledge is ignored. In the paradigm of inductive logic 
programming (ILP), the distribution is ignored, but background knowledge is 
taken into account. Different approaches vary the particular success criterion. 
For instance, information gain, minimal description length, or Bayes-oriented 
criteria are discussed. Structural risk minimization balances the error rate and 
the complexity of hypotheses. The variety of criteria can be subsumed under 
that of accuracy of predictions (i.e. correctness of classifications of new observa- 
tions). In contrast, applications of machine learning are evaluated with respect 
to additional criteria: 

comprehensibility: How easily can the learning results be interpreted by hu- 
man decision makers? 

embeddedness: Most learning results are produced in order to enhance the ca- 
pabilities of another system (e.g., problem solving system, natural language 
system, robot), the so-called performance system. The criterion is here: How 
easily can learning results be used by the performance system? 

These criteria are most often ignored in theoretical studies. Comprehensibility 
and embeddedness further constrain the choice of an appropriate hypothesis 
language Lh- The problem is that they frequently do so in a contradictory 
manner. A representation that is well-suited for a human user may be hard 
to handle by a performance system. A representation that is well-suited for a 
system to perform a certain task may be hard to be understood by a human 
user. To make things even worse, we also consider the input of the learning 
algorithm. A learning algorithm Ai can be characterized by its input (i.e. Lei, 
Lsi) and output (i.e. Lhi) formats. Those pairs of input and output languages, 
for which an efficient and effective learning algorithm exists, are called admissible 
languages. In this way, Lh is further constrained by the given data in Le- It is 
hard or even impossible to design a representation Lh which is 

— efficiently learnable 

— from data in a given format, 

— easily understood by users, 

— and can be put to direct use by a performance system. 

Instead of searching for one representation that fulfills at the same time the 
requirements of the learning algorithm, the given data, the users, and the per- 
formance system, we may want to follow a more stepwise procedure. We divide 
the overall representation problem into several subproblems. Each subproblem 
consists of two parts, namely determining an appropriate representation and de- 
veloping a transformation from a given representation into it. In order to not 
trade in a simple representation language for a complex transformation process. 
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we design families of representations, where each family member is well suited 
for a particular set of requirements, and efficient transformations between the 
family members are possible. The subproblems are: 

Learnability: The majority of theoretical analyses shows the learnability of 
concept classes under certain restrictions of the learning task, the represen- 
tation languages, and the criterion for accuracy. Hence, this subject does not 
require justification. However, the transformation from a given representa- 
tion into the desired one raises questions that have not yet been addressed 
frequently. The transformation here converts L^i into Luj, where Luj is 
better suited for learning. 

Optimizing input data: The hardest task of machine learning applications 
is the design of a well-suited representation Lej- On one hand, the given 
representation formalism Lei must be converted into the one accepted by the 
learning algorithm as input. On the other hand, the representation language 
(signature) can be optimized. The formalism remaining the same, the set 
of features (predicates) is changed. The standard procedure is to run the 
algorithm on diverse feature sets until one is found that leads to an acceptable 
accuracy of learning. Most feature construction or selection methods aim at 
overcoming this trial-and-error approach. 

Comprehensibility: Although stated as a primary goal of machine learning 
from the very beginning on, the criterion of comprehensibility never became 
a hot topic of learning theory. 

Embeddedness or program optimization: Learning results are supposed to 
enhance a procedure. Most often, the procedure is already implemented as 
a particular performance system. The learning result in Lhi must be trans- 
formed into the representation Lnj of the performance system. Quite often, 
the performance system offers restrictions that are not presupposed by the 
learning algorithm. In this case, the transformation may be turned into an 
optimization, exploiting the application’s restrictions. 

In this paper, a family of subsets of Horn logic is discussed and illustrated 
by a robot application. 



2 Learnability of Restricted Horn Logic 

The learnability subject does not need any justification here. However, the trans- 
formation of one representation Lhi into another one Lnj which is better suited 
for learning, does. An important question is, how to recognize those parts in a 
given representation (or concept class) automatically, that cause negative learn- 
ability results. The user may then be asked, whether the learning task could be 
weakened, or at least be warned that it may take exponential time to compute 
a result. For instance, detecting variables that occur in the head of a clause but 
not in its body prohibit the clause being generative. This can easily be checked 
automatically. Even suggestions for fixing the clause can be computed on the 
basis of the predicates defined so far Alternatively, the learning algorithm 
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may start with a simple representation and increase the complexity of the hy- 
pothesis space. Of course, this alternative is easier than the recognition of the 
complexity class. 

Desired are learnability proofs that can be made operational in the sense 
that indicators for (non-) learnability are defined that can be recognized auto- 
matically. One example for an operational complexity criterion is the one used 
by the support vector machine m The width of the margin separating positive 
and negative examples corresponds to the VC dimension of the hypothesis space, 
but can be calculated much easier Q According to this operational criterion hy- 
pothesis spaces of increasing complexity are tried by the learning algorithm. A 
rough indication of the complexity of hypotheses from the ILP paradigm is the 
number of literals in a clause. According to this criterion, the RDT algorithm 
searches fo increasingly complex hypotheses |E| and a similar procedure is fol- 
lowed in pj. However, the number of literals is merely a heuristic measure of 
complexity. 

Let me now illustrate operational proofs by |S|. The difficulty of first-order 
logic induction originates in first-order logic deduction. Hypothesis testing, sat- 
uration of examples with background knowledge, the comparison of competing 
hypotheses, and the reduction of hypotheses under equivalence all use the deduc- 
tive inference. Since 0— subsumption I2ni is a correct and for not self-resolving, 
non tautological clauses a complete inference, most ILP systems use it. 

0— subsumption: A clause D 0— subsumes a clause C iff there exists a substi- 
tution 9 such that DO C C. 

0— subsumption is NP-complete because of the indeterminism of choosing 6 
in the general case. Hence, the clauses need to be restricted. Restricting Horn 
clauses to k-llocal clauses allows for polynomial learning 0. William Cohen’s 
proof maps k-llocal clauses on monomials for which learnability proofs exist. 
However, the proof is not operational. It just shows where it is worth to look 
for theoretically well-based algorithms. What we need is an operational proof 
of polynomial complexity of 0— subsumption of k-llocal clauses in order to solve 
the deductive problem within induction. We then need a learning algorithm 
exploiting the restriction so that learning k-llocal clauses is polynomial. Desired 
is an automatic check, whether a clause is k-llocal, or not. 

k-llocal Horn clause: Let D = Dq <— Djjet, Dnondet be a Horn clause, 
where Duet is the deterministic and Dnqndet is the indeterministic part 
of its body. Let vars be a function computing the set of variables of a clause. 
LOCi C Dmondet is a local part of D, iff 

{vars(LOCi) \ vars{{Do, Duet})) H vars{DEONDET \ LOCi) = 0 and 
there does not exist LOCj C LOCi which is also a local part of D. 

A local part LOCi is k-llocal, iff 

there exists a constant k such that k >| LOCi \- 

A Horn clause is k-llocal, iff each local part of it is k-llocal. 

^ Determining the VC dimension is a problem in where n is the size of a 

matrix representing the concept class m- 
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Jorg-Uwe Kietz presents a fast subsumption algorithm for D = Dq <— 
Ddet, LOCi, LOCn and any Horn clause C 0 . For the deterministic part of 
D, there exists exactly one substitution 9. An efficient subsumption algorithm 
for deterministic Horn clauses is presented in m- In the indeterministic part 
of D, it is looked for local parts, i.e. for sets of literals which do not share an 
indeterministic variable. This is done by merging sets of literals which have a 
variable in common. In this way, the algorithm checks a given clause D, whether 
it is k-local, or not. This can be done polynomially. If the local part is bounded 
by some k, its subsumption is in 0{n ■ (k^- \ C |)). If, however, DBody consists 
of one large local part, the subsumption is exponential. The proof of polyno- 
mial complexity in the number of literals of the clauses D and C corresponds 
directly to the fast subsumption algorithm. The algorithm guarantees efficient 
subsumption for k-local Z?, but can be applied to any Horn clause. 

The least general generalization, LGG, of Gordon Plotkin m is then ad- 
justed to k-llocal clauses and it is shown that the size of the LGG of k-llocal 
clauses does not increase exponentially (as does the LGG of unrestricted Horn 
clauses) and reduction of k-llocal clauses is polynomial, too 0 . Together with the 
negtive learnability results in m, k-llocal clauses can be considered the border- 
line between learnable and not learnable indeterministic clauses. This concludes 
the illustration of what I mean by operational proofs. 



3 Optimizing Input Data 

The transformation of the given data within the same representation formalism 
is called feature construction/extraction, if new features are built on the basis 
of the given ones, and feature selection, if the most relevant subset of features 
is selected. An explicit construction is the invention and definition of new pred- 
icates as presented by 1^, ^9J, and m- Including the transformation into the 
learning algorithm is called constructive induction m 0. An implicit adding 
of new features is performed by the support vector machine nq. Using a kernel 
function, the feature space is transformed such that the observations become 
linearly separable. A more complex input representation allows the algorithm 
to remain simple. Only because the enriched feature space need not be used 
for computation- the kernel functions are used instead -, transformation plus 
learning algorithm remain efficient. It is an open question how to make explicit, 
which of the added features contributed most to a good learning result. 

Let me now turn to the transformation from one representation formalism 
to another one. A learning algorithm is selected because of both, its input and 
output formalisms (plus other properties). It may turn out, however, that there 
does not exist an algorithm with the desired input-output pair of languages. In 
this case, one of the options is to transform the data from the formalism in which 
they are available into the input format of the selected algorithm. The current 
interest in feature construction may stem from knowledge discovery in databases 
(KDD) PSl. The given database representation has to be transformed into one 
which is accepted by the learning algorithm. Of course, for an ILP learning 
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algorithm there exists a 1:1 mapping from a database table to a predicate 0- 
However, this simple transformation most often is not one that eases learning. 
Since the arity of predicates cannot be changed through learning, a huge number 
of irrelevant database attributes is carried along. Therefore, the transformation 
from database tables into predicates should map parts of the tables to predicates, 
where the database key is one of the arguments of the predicate. A tool that 
gives an ILP learner direct access to a relational database, provides different 
types of mappings from tables to predicates, and constructs SQL queries for 
hypothesis testing automatically has been developed m- However, the open 
question remains whether there are theoretically well-based indicators of the 
optimal mapping for a particular learning task. This is an essential question, 
since the number of possible mappings is only bounded by the size of the universal 
database relation. Therefore, it does not help that we may well compute the size 
of the hypothesis space for each mapping. What is required is a structure in the 
space of mappings that allows for efficient search. 

A frequent task when applying an ILP algorithm is to transform numerical 
measurements into qualitative Horn logic clauses (facts). A mobile robot, for in- 
stance, reports its action and perception by multivariate time series. Each sensor 
and the moving engine deliver a measurement per moment. The ILP learner re- 
quires a description of a path and what has been sensed in terms of time intervals 
during which some assertions are true. Hence, the signal-to-symbol transforma- 
tion needs to find adequate time intervals and assertions that summarize the 
measurements within the time interval. This task is different from time series 
analysis, where the curves of measurement are approximated. It corresponds 
to the analysis of event sequences, where events have to be automatically rec- 
ognized. As opposed to current approaches which learn about event sequences 
(e.g., 133, PI), the robot application asks for a representation that can be pro- 
duced incrementally on-line m For our robot application we have implemented 
a simple algorithm which constructs predicates of the form 
increasingCMissionID , Angle, Sensor, FromTime, ToTime, RelToMove) 
from measurements m- During the time interval FromTime, ToTime, the 
sensor preceives increasing distance while being oriented in a particular angle 
with respect to the global coordinates. The relation between measured distance 
and moved distance is expressed by the last argument. The algorithm reads in 
a measurement and compares it with the current summarizing assertion (i.e., a 
predicate with all arguments bound except for ToTime). Either ToTime is bound, 
the event has ended and a new one starts, or the next measurement is read 
in. This procedure has some parameters (e.g., tolerated variance). These are 
adjusted on the data so that the transformation itself is adaptive. 

The language L e constructed has the nice property of fitting to general chain 
rules. Given an appropriate substitution cr, a literal Bi of a chain rule can be 
unified with a ground fact f € Le, i.e. Bicr = /. 

General chain rule: Let S' be a literal or a set of literals. Let args{S) be a 
function that returns the Datalog arguments of S. A normal clause is a 
general chain rule, iff its body literals can be arranged in a sequence 
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Bo ^ Bi,B2„ Bk+i 

such that there exist Datalog terms X,Z G args{Bo), X,Yi G args{Bi), 
Yi,Y 2 G args{B2), ...,Yfc-i,hfe G args(Bk), and Yk,Z G args{Bk+i). 

Obviously, chain rules are well-suited for the representation of event sequences, 
using the time points as chaining arguments. Le is constructed such that chain 
rules can be learned from them. In other words, having chain rules in mind for 
Lh we transform the given data into facts that correspond to literals of Lh- 
Note, that this transformation is not motivated by the learning algorithm in 
order to make learning simpler. The aim is to design Le such that a Lh can be 
learned which can easily embedded in the robot application. 

4 Comprehensibility 

It is often claimed that learning results are easier to understand than statisti- 
cal results which demand a human interpreter who translates the results into 
conclusions for the customer of the statistical study. In particular, decision trees 
and rules were found to be easily undertandable. However, these statements lack 
justification. 

In a user-independent way, proposed criteria for comprehensibility most of- 
ten refer to the length of a description (the minimal description length (MDL) 
principle El)- For decision trees, the number of nodes was used as a guideline 
for comprehensibility. For logic programs, the number of literals of a clause or 
the number of variables was used as an operational criterion for the ease of un- 
derstanding mm Irene Stahl corrected the MDL for ILP by restricting the 
description length of examples to that of positive examples only m Although 
presented as an operationalisation of comprehensibility, compression may lead 
to almost incomprehensible description^. Its value for hypothesis testing and 
structuring the hypothesis space not neglected, as a means to achieve compre- 
hensible learning results it is questionable. 

Ryszard Michalski has proposed to use natural language for communicating 
learning results. The ease of transforming Lh into natural language can then be 
used as a criterion for the naturalness of the representation. 

Edgar Sommer introduced the notion of extensional redundancy. Removing 
extensional redundancy compresses a theory, but compressions are not restricted 
to that. It is known from psychology that some redundancy eases understanding. 
Extensional redundancy aims at characterizing superfluous parts of a theory m 

Extensional redundancy: Let G be the set of goal concepts in a theory T. 
Let Q be the set of instances of goal concepts that are derivable from T. Let 
C be a clause in T and L be a literal in C. 

^ The reconstruction of the examples and background knowldge from the logic theory 
and the example encoding by a reference Turing machine leads to a measure of 
compression: the length of the output tape minus the length of the input tape. 

® Think of compressed text files! 
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A literal L is extensionally redundant in C with respect to g S G, iff C is in 
the derivation of g and C\ {L}. 

A clause C is extensionally redundant in T with respect to Q, iff T h Q Etnd 
T\{G}hg. 

Structure is a key to comprehensibility. Structure is achieved by the folding 
operations which leaves the minimal Herbrand model of a theory unchanged m- 

fold: Let C,D G Ti be ordered clauses of the form 
G — Go ^ Gi , * * * , Cm-) C-m-\-i ) ‘ ‘ ‘ ) Cm+n and 
D = Do ^ Di, - ■ ■ , Dm- 

Let cr be a substitution satisfying the following conditions 

1 . Ci = Dia, i = 1 , ■■■) m 

2. let Xi,...,Xi be variables that occur in Dsody but not in Dnead] each 
Xj(j,j = 1 , I does not occur in Cnead nor in Cm+i, • • • , Cm+n', if i yf i, 
then Xicr ^ 

3 . D is the only clause in Ti whose head is unifiable with Doa. 

If such a substitution exists, then 

G — Go ^ Cm-\-i) ' ' ‘ ) Cm+n and 

T,+i = fold(T,, G, D)=T,\ {C} U {G'}. 

Otherwise, Ti remains unchanged. 

Several stratification operators have been developed that structure a theory. 
The FENDER program folds clauses that define a target concept with an in- 
termediate concept m The intermediate concept is made of common partial 
premises. A common partial premise is a set of literals which most frequently 
occur together in the given theory. The literals must share a variable. This will 
be omitted in the folded clause. However, not all logically unnecessary variables 
are omitted. This is meant to not hide relevant information from the user, but 
only encapsulate internal details. 

An alternative stratification method named prefix elimination has been de- 
veloped particularly for chain programs i-e. for better communicating event 
sequences. As opposed to FENDER which gathers common partial premises re- 
gardless of an ordering of literals, the prefix elimination method only replaces 
common prefixes of chained literals in a clause’s body. Hence, applying prefix 
elimination to a chain program outputs a chain program. The prefix elimination 
method looks for common literal sequences in all clauses, not only ones that are 
used to define the same concept. It suppresses all unnecessary variables. This is 
meant to exhibit the time relations more clearly. An abstracted example from 
our robot application may illustrate the method. 

Example: Let the theory Ti be 
Gi = alongWall{S, X, Z) 

<— stand(y, O, X, Fi), stableiV, O, Yi, F2)) decrPeak{S, O, I2, Z) 

C2 = alongDoor(S) X, Z) 

<— stand(y, O, X, Fi), stableiV, O, Fi, F2)) incrPeak{S, V, Y2, Z) 

Prefix elimination returns Ti+i: 
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C[ = alongWall{S, X, Z) <— standstable{V, O, X, Y 2 ), decrPeak{S, O, I 2 , Z) 
C '2 = alongDoor(S, X, Z) <— standstable{V^ O, X, I 2 ), incrPeak{S, V, I 2 , Z) 
D = standstableiy, O, X, I 2 ) <— stand(V, O, X, Yi), stable{V, O, Yi, Y 2 ) 

Two time intervals are summarized, but no information is abstracted away. 

Compression and stratification are two operational criteria for the compre- 
hensibility of logical theories. Whether they correspond to true needs of users 
should be studied empirically. Since the representation is to be understood by 
human users, the answer depends on general cognitive capabilities as well as on 
user-specific preferences which, in turn, depend on prior knowledge and training. 
Adjusting a representation to particular preferences is a learning task in its own 
right. A number of answer set equivalent representations could be presented to 
the user who selects his favorite one. The selections serve as positive examples 
and a profile of the user is learned which guides further presentations. Theoret- 
ical analysis of the user profiles could well lead to a refinement of compression 
which excludes incrompehensibly compact theories. 

5 Embeddedness or Program Optimization 

Optimizing learning results for their use by a performance system is easier than 
optimizing them for human understanding, since the requirements of the system 
are known. In many cases the requirements by users and performance system 
are conflicting. However, they can both start from the same learning result, 
if Lh has been chosen carefully. In our robotics application, we used the fact 
that chain programs correspond to definite finite automata (DFA) . Since general 
chain programs - as opposed to elementary chain rules - are not equivalent to 
context free grammars, we cannot apply the transformation presented by j^. 
The general chain rules can be translated into a context-free grammar, but a 
transformed grammar cannot be uniquely translated back into general chain 
rules. Hence, we may theoretically describe the learning results with reference 
to context-free grammars, but this analysis cannot be put to use. It is possible, 
however, to transform general chain rules into a DFA. Starting from a theory 
which has been stratified by prefix elimination, the mapping is as follows |23|: 

— A clause head becomes a state of the DFA. 

— A clause defining a prefix becomes a transition of the DFA. 

— The clause head of an overall goal with its substitutions becomes an output. 

The DFA is restricted to successful derivations. Compilation of learned theories 
consisting of 250 to 470 clauses took between 1 and 5 minutes CPU time 123!. 
A system using the learned and optimized knowledge was developed by Volker 
Klingspor pg, |Z2|. His system SHARC performs object-recognition, planning, 
and plan execution. The low-level steps of object recognition are performed in 
parallel, each sensor having its own process. Object recognition is performed by 
forward inference on optimized clauses using a marker passing strategy. Planning 
and plan execution is performed by backward inference on optimized clauses. The 
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general chain rules are learned by GRDT, an ILP learner with declarative bias 
m- GRDT is not restricted to Lh being general chain rules. The top goals to 
be learned were moving along a door and moving through a door. Guiding the 
mobile robot by a joy stick through and along doors in one environment resulted 
in the training data. The test was that the robot actually moved along or through 
other doors in a different but similar environment (i.e. the university building). 
No map was used. Many more experiments have been made using the simulation 
component of the PIONEER mobile robot. We were told before the project, that 
real-time behavior is impossible on the basis of logic programs, neuro-computing 
or other numerical processing would be mandatory. However, the real time from 
sending sensor measurements to SHARG until robot’s reaction in almost all cases 
was below 0.005 seconds and at most 0.006 seconds US). This shows that learned 
clauses can navigate a robot in real time, if they are parallelized and optimized 
with respect to their use. In contrast, applying the learned clauses in their most 
comprehensible from is not fast enough for a robot application. 

6 Conclusion 

This paper tries to show that theory need not be restricted to the central learn- 
ing step, leaving practicians alone with the design and optimization of Le and 
Lh- In contrast, developing operational indicators for the quality of a repre- 
sentation asks for theoretical analysis. Goncerning the criteria of learnability, 
comprehensibility, and embeddedness, a brief overview of approaches towards 
well-based guidelines for the development and transformation of Le and Lh is 
given. The question of how to design an appropriate Lej from given data in 
Lei is particularly stressed. On one hand, the design of Lej is oriented towards 
making learning easier. On the other hand, the design of Lej is oriented towards 
an admissible Lh which can be optimized with respect to embeddedness. 

The tailoring of representations is illustrated by a robot application. 

— A general ILP learner with declarative bias, GRDT, outputs general chain 
clauses, because Le was made of ground chain facts. 

— The original numerical data from the robot were transformed into ground 
chain facts by a simple procedure which uses parameter adjustment as its 
learning method. 

— A theory can be optimized concerning comprehensibility by introducing in- 
termediate predicates, which are then used for folding. 

— A theory made of general chain rules can be optimized for real-time deductive 
inference using the prefix elimination method and the transformation into 
DFAs. 
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Abstract. Boosting is a general method for improving the accuracy of 
any given learning algorithm. Focusing primarily on the AdaBoost algo- 
rithm, we briefly survey theoretical work on boosting including analyses 
of AdaBoost’s training error and generalization error, connections be- 
tween boosting and game theory, methods of estimating probabilities 
using boosting, and extensions of AdaBoost for multiclass classification 
problems. Some empirical work and applications are also described. 



Background 

Boosting is a general method which attempts to “boost” the accuracy of any 
given learning algorithm. Kearns and Valiant pTH ETTI) were the first to pose the 
question of whether a “weak” learning algorithm which performs just slightly 
better than random guessing in Valiant’s PAC model M can be “boosted” 
into an arbitrarily accurate “strong” learning algorithm. Schapire m came up 
with the first provable polynomial-time boosting algorithm in 1989. A year later, 
Freund m developed a much more efficient boosting algorithm which, although 
optimal in a certain sense, nevertheless suffered from certain practical drawbacks. 
The first experiments with these early boosting algorithms were carried out by 
Drucker, Schapire and Simard uni on an OCR task. 



AdaBoost 

The AdaBoost algorithm, introduced in 1995 by Freund and Schapire 1221 , solved 
many of the practical difficulties of the earlier boosting algorithms, and is the 
focus of this paper. Pseudocode for AdaBoost is given in Fig. Ein the slightly 
generalized form given by Schapire and Singer m- The algorithm takes as input 
a training set (xi,yi), . . . , (xm,ym) where each Xi belongs to some domain or 
instance space X, and each label yi is in some label set Y. For most of this 
paper, we assume Y = { — 1,-|-1}; later, we discuss extensions to the multiclass 
case. AdaBoost calls a given weak or base learning algorithm repeatedly in a 
series of rounds t = 1,...,T. One of the main ideas of the algorithm is to 
maintain a distribution or set of weights over the training set. The weight of 
this distribution on training example i on round t is denoted Dt{i). Initially, all 
weights are set equally, but on each round, the weights of incorrectly classified 
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Given: (xi,yi), . . . , {xm,ym) where Xi e X, yi eY = {-1,+1} 
Initialize Di(i) = 1/m. 

For t = 1, . . . ,T: 



— Train weak learner using distribution Dt- 

— Get weak hypothesis ht : X ^ W. 

— Ghoose at G IR. 

— Update: 



Dt+i{i) 



Dtji) exp{-aty^ht{x^)) 
Zt 



where Zt is a normalization factor (chosen so that Dt+i will be a distribu- 
tion). 



Output the final hypothesis: 



H (x) = sign 



y^athtjx) 



Fig. 1. The boosting algorithm AdaBoost. 



examples are increased so that the weak learner is forced to focus on the hard 
examples in the training set. 

The weak learner’s job is to find a weak hypothesis ht : X W appropriate 
for the distribution Dt- In the simplest case, the range of each ht is binary, i.e., 
restricted to { — the weak learner’s job then is to minimize the error 



yi\ ■ 



Once the weak hypothesis ht has been received, AdaBoost chooses a param- 
eter Q!( G M which intuitively measures the importance that it assigns to ht- In 
the figure, we have deliberately left the choice of at unspecified. For binary ht, 
we typically set 

• ( 1 ) 

More on choosing at follows below. The distribution Dt is then updated using 
the rule shown in the figure. The final hypothesis iJ is a weighted majority vote 
of the T weak hypotheses where at is the weight assigned to ht- 



Analyzing the Training Error 

The most basic theoretical property of AdaBoost concerns its ability to reduce 
the training error. Specifically, Schapire and Singer m, in generalizing a theorem 
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of Freund and Schapire show that the training error of the final hypothesis 
is bounded as follows: 

— \{i:H{x^)^y,}\ < —^exp{-yj{xi)) = Y[Zt (2) 

i t 

where f{x) = ^{x) = sign(/(a;)). The inequality follows 

from the fact that > 1 if ?/i 7^ H{xi). The equality can be proved 

straightforwardly by unraveling the recursive definition of Dt- 

Eq. (|2I) suggests that the training error can be reduced most rapidly (in a 
greedy way) by choosing at and ht on each round to minimize 

Zt = Dt{i) ex^{-atyrht{xi)). 

i 

In the case of binary hypotheses, this leads to the choice of a* given in Eq. dU 
and gives a bound on the training error of 



n [2\/ et(l - et) 



n \/l - 47? < exp 




1I 



where 64 = 1/2 — 74. This bound was first proved by Freund and Schapire EH- 
Thus, if each weak hypothesis is slightly better than random so that 74 is bounded 
away from zero, then the training error drops exponentially fast. This bound, 
combined with the bounds on generalization error given below prove that Ada- 
Boost is indeed a boosting algorithm in the sense that it can efficiently convert 
a weak learning algorithm (which can always generate a hypothesis with a weak 
edge for any distribution) into a strong learning algorithm (which can generate 
a hypothesis with an arbitrarily low error rate, given sufficient data). 

Eq. (|2) points to the fact that, at heart, AdaBoost is a procedure for finding 
a linear combination / of weak hypotheses which attempts to minimize 



^exp(-yj(x4)) = ^ 

i i 




y^^athtjxi) 



(3) 



Essentially, on each round, AdaBoost chooses ht (by calling the weak learner) 
and then sets 0:4 to add one more term to the accumulating weighted sum of weak 
hypotheses in such a way that the sum of exponentials above will be maximally 
reduced. In other words, AdaBoost is doing a kind of steepest descent search to 
minimize Eq. 0 where the search is constrained at each step to follow coordinate 
directions (where we identify coordinates with the weights assigned to weak 
hypotheses) . 

Schapire and Singer m discuss the choice of at and ht in the case that ht 
is real- valued (rather than binary). In this case, ht{x) can be interpreted as a 
“confidence-rated prediction” in which the sign of ht{x) is the predicted label, 
while the magnitude |h.4(a:)| gives a measure of confidence. 
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Generalization Error 



Freund and Schapire 1221 showed how to bound the generalization error of the 
final hypothesis in terms of its training error, the size m of the sample, the 
VC-dimension d of the weak hypothesis space and the number of rounds T of 
boosting. Specifically, they used techniques from Baum and Haussler ^ to show 
that the generalization error, with high probability, is at most 

Pr [H (x) yf y] + O 




where Pr [•] denotes empirical probability on the training sample. This bound 
suggests that boosting will overfit if run for too many rounds, i.e., as T becomes 
large. In fact, this sometimes does happen. However, in early experiments, several 
authors 0H1E1 observed empirically that boosting often does not overfit, even 
when run for thousands of rounds. Moreover, it was observed that AdaBoost 
would sometimes continue to drive down the generalization error long after the 
training error had reached zero, clearly contradicting the spirit of the bound 
above. For instance, the left side of Fig. Q shows the training and test curves of 
running boosting on top of Quinlan’s C4.5 decision-tree learning algorithm m 
on the “letter” dataset. 

In response to these empirical findings, Schapire et al. m, following the 
work of Bartlett [5|, gave an alternative analysis in terms of the margins of the 
training examples. The margin of example {x, y) is defined to be 

y'^athtix) 

t 

t 

It is a number in [— which is positive if and only if H correctly classifies 
the example. Moreover, as before, the magnitude of the margin can be inter- 
preted as a measure of confidence in the prediction. Schapire et al. proved that 
larger margins on the training set translate into a superior upper bound on the 
generalization error. Specifically, the generalization error is at most 



Pr [margin^ (a;, y) < 0] -f O 




for any 0 > 0 with high probability. Note that this bound is entirely independent 
of T, the number of rounds of boosting. In addition, Schapire et al. proved that 
boosting is particularly aggressive at reducing the margin (in a quantifiable 
sense) since it concentrates on the examples with the smallest margins (whether 
positive or negative). Boosting’s effect on the margins can be seen empirically, 
for instance, on the right side of Fig. El which shows the cumulative distribution 
of margins of the training examples on the “letter” dataset. In this case, even 
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# rounds 




Fig. 2. Error curves and the margin distribution graph for boosting C4.5 on 
the letter dataset as reported by Schapire et al. m- Left: the training and test 
error curves (lower and upper curves, respectively) of the combined classifier as 
a function of the number of rounds of boosting. The horizontal lines indicate the 
test error rate of the base classifier as well as the test error of the final combined 
classifier. Right: The cumulative distribution of margins of the training examples 
after 5, 100 and 1000 iterations, indicated by short-dashed, long-dashed (mostly 
hidden) and solid curves, respectively. 



after the training error reaches zero, boosting continues to increase the margins 
of the training examples effecting a corresponding drop in the test error. 

Attempts (not always successful) to use the insights gleaned from the theory 
of margins have been made by several authors BE3E21- In addition, the margin 
theory points to a strong connection between boosting and the support-vector 
machines of Vapnik and others gmugg which explicitly attempt to maximize 
the minimum margin. 



A Connection to Game Theory 



The behavior of AdaBoost can also be understood in a game-theoretic setting as 
explored by Freund and Schapire misi (see also Grove and Schuurmans m 
and Breiman [ 7 |). In classical game theory, it is possible to put any two-person, 
zero-sum game in the form of a matrix M. To play the game, one player chooses 
a row i and the other player chooses a column j. The loss to the row player 
(which is the same as the payoff to the column player) is M.^. More generally, 
the two sides may play randomly, choosing distributions P and Q over rows or 
columns, respectively. The expected loss then is P"^MQ. 

Boosting can be viewed as repeated play of a particular game matrix. Assume 
that the weak hypotheses are binary, and let TL = {ft-i, be the entire weak 

hypothesis space (which we assume for now to be finite) . For a fixed training set 
(cci, ?/i), . . . , (Xm, ym)i the game matrix M has m rows and n columns where 



'M.ij — 



1 if hj{xi) = Ui 
0 otherwise. 
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The row player now is the boosting algorithm, and the column player is the 
weak learner. The boosting algorithm’s choice of a distribution Dt over training 
examples becomes a distribution P over rows of M, while the weak learner’s 
choice of a weak hypothesis ht becomes the choice of a column j of M. 

As an example of the connection between boosting and game theory, consider 
von Neumann’s famous minmax theorem which states that 



for any matrix M. When applied to the matrix just defined and reinterpreted 
in the boosting setting, this can be shown to have the following meaning: If, 
for any distribution over examples, there exists a weak hypothesis with error 
at most 1/2 — 7 , then there exists a convex combination of weak hypotheses 
with a margin of at least 2y on all training examples. AdaBoost seeks to find 
such a final hypothesis with high margin on all examples by combining many 
weak hypotheses; so in a sense, the minmax theorem tells us that AdaBoost 
at least has the potential for success since, given a “good” weak learner, there 
must exist a good combination of weak hypotheses. Going much further, Ada- 
Boost can be shown to be a special case of a more general algorithm for playing 
repeated games, or for approximately solving matrix games. This shows that, 
asymptotically, the distribution over training examples as well as the weights 
over weak hypotheses in the final hypothesis have game-theoretic intepretations 
as approximate minmax or maxmin strategies. 

Estimating Probabilities 

Classification generally is the problem of predicting the label y of an example x 
with the intention of minimizing the probability of an incorrect prediction. How- 
ever, it is often useful to estimate the probability of a particular label. Recently, 
Friedman, Hastie and Tibshirani m suggested a method for using the output of 
AdaBoost to make reasonable estimates of such probabilities. Specifically, they 
suggest using a logistic function, and estimating 



where, as usual, f{x) is the weighted average of weak hypotheses produced by 
AdaBoost. The rationale for this choice is the close connection between the log 
loss (negative log likelihood) of such a model, namely. 



maxmin P"'"MQ = minmaxP"'"MQ 
Q P p Q 



e- 



J{^) 



Pr/ b = +1 I x] 



gf(x) _|_ g-/(a:) 



( 4 ) 




( 5 ) 



and the function which, we have already noted, AdaBoost attempts to minimize: 
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Specifically, it can be verified that Eq. 0 is upper bounded by Eq. 0. In 
addition, if we add the constant 1 — In 2 to Eq. 0 (which does not affect its 
minimization), then it can be verified that the resulting function and the one in 
Eq. 0 have identical Taylor expansions around zero up to second order; thus, 
their behavior near zero is very similar. Finally, it can be shown that, for any 
distribution over pairs {x,y), the expectations 



E 



In (l + 



and 



E 






are minimized by the same function /, namely. 



f{x) = iln 



/ Pr [y = +1 I x] 

\Pr [y = -1 1 x] 



Thus, for all these reasons, minimizing Eq. 0, as is done by AdaBoost, can 
be viewed as a method of approximately minimizing the negative log likelihood 
given in Eq. 0 . Therefore, we may expect Eq. 0 to give a reasonable proba- 
bility estimate. 

Friedman, Hastie and Tibshirani also make other connnections between Ada- 
Boost, logistic regression and additive models. 



Multiclass Classification 

There are several methods of extending AdaBoost to the multiclass case. The 
most straightforward generalization called AdaBoost. Ml, is adequate when 
the weak learner is strong enough to achieve reasonably high accuracy, even 
on the hard distributions created by AdaBoost. However, this method fails if 
the weak learner cannot achieve at least 50% accuracy when run on these hard 
distributions. 

For the latter case, several more sophisticated methods have been developed. 
These generally work by reducing the multiclass problem to a larger binary 
problem. Schapire and Singer ’sEO] algorithm AdaBoost. MH works by creating a 
set of binary problems, for each example x and each possible label y, of the form: 
“For example x, is the correct label y or is it one of the other labels?” Freund 
and Schapire ’sP2j algorithm AdaBoost. M2 (which is a special case of Schapire 
and Singer ’s m AdaBoost. MR algorithm) instead creates binary problems, for 
each example x with correct label y and each incorrect label y' of the form: “For 
example x, is the correct label y or y'?” 

These methods require additional effort in the design of the weak learn- 
ing algorithm. A different technique m, which incorporates Dietterich and 
Bakiri’s m method of error-correcting output codes, achieves similar provable 
bounds to those of AdaBoost. MH and AdaBoost. M2, but can be used with 
any weak learner which can handle simple, binary labeled data. Schapire and 
Singer m give yet another method of combining boosting with error-correcting 
output codes. 
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boosting stumps boosting C4.5 



Fig. 3. Comparison of C4.5 versus boosting stumps and boosting C4.5 on a set 
of 27 benchmark problems as reported by Freund and Schapire m- Each point 
in each scatterplot shows the test error rate of the two competing algorithms on 
a single benchmark. The y-coordinate of each point gives the test error rate (in 
percent) of C4.5 on the given benchmark, and the ^-coordinate gives the error 
rate of boosting stumps (left plot) or boosting C4.5 (right plot). All error rates 
have been averaged over multiple runs. 




Experiments and Applications 

Practically, AdaBoost has many advantages. It is fast, simple and easy to pro- 
gram. It has no parameters to tune (except for the number of round T). It 
requires no prior knowledge about the weak learner and so can be flexibly com- 
bined with any method for finding weak hypotheses. Finally, it comes with a 
set of theoretical guarantees given sufficient data and a weak learner that can 
reliably provide only moderately accurate weak hypotheses. This is a shift in 
mind set for the learning-system designer: instead of trying to design a learning 
algorithm that is accurate over the entire space, we can instead focus on finding 
weak learning algorithms that only need to be better than random. 

On the other hand, some caveats are certainly in order. The actual perfor- 
mance of boosting on a particular problem is clearly dependent on the data and 
the weak learner. Consistent with theory, boosting can fail to perform well given 
insufficient data, overly complex weak hypotheses or weak hypotheses which are 
too weak. Boosting seems to be especially susceptible to noise m ( more on this 
later) . 

AdaBoost has been tested empirically by many researchers, including jm II 21 
[0 EEl ^ For instance, Freund and Schapire |2Dj tested AdaBoost 
on a set of UCI benchmark datasets using C4. 5 |2S! as a weak learning 
algorithm, as well as an algorithm which finds the best “decision stump” or 
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Fig. 4. Comparison of error rates for AdaBoost and four other text categoriza- 
tion methods (naive Bayes, probabilistic TF-IDF, Rocchio and sleeping experts) 
as reported by Schapire and Singer EH The algorithms were tested on two text 
corpora — Reuters newswire articles (left) and AP newswire headlines (right) 
— and with varying numbers of class labels as indicated on the a;-axis of each 
figure. 



single-test decision tree. Some of the results of these experiments are shown in 
Fig.EI As can be seen from this figure, even boosting the weak decision stumps 
can usually give as good results as C4.5, while boosting C4.5 generally gives the 
decision-tree algorithm a significant improvement in performance. 

In another set of experiments, Schapire and Singer EH used boosting for text 
categorization tasks. For this work, weak hypotheses were used which test on the 
presence or absence of a word or phrase. Some results of these experiments com- 
paring AdaBoost to four other methods are shown in Fig.0 In nearly all of these 
experiments and for all of the performance measures tested, boosting performed 
as well or significantly better than the other methods tested. Boosting has also 
been applied to text filtering m, “ ranking” problems m and classification 
problems arising in natural language processing DEZI. 

The final hypothesis produced by AdaBoost when used, for instance, with a 
decision-tree weak learning algorithm, can be extremely complex and difficult to 
comprehend. With greater care, a more human-understandable final hypothesis 
can be obtained using boosting. Cohen and Singer mg showed how to design 
a weak learning algorithm which, when combined with AdaBoost, results in 
a final hypothesis consisting of a relatively small set of rules similar to those 
generated by systems like RIPPER P]i IREP |2S| and C4.5rules P^j. Cohen and 
Singer’s system, called SLIPPER, is fast, accurate and produces quite compact 
rule sets. In other work, Freund and Mason P21 showed how to apply boosting to 
learn a generalization of decision trees called “alternating trees.” Their algorithm 
produces a single alternating tree rather than an ensemble of trees as would be 
obtained by running AdaBoost on top of a decision-tree learning algorithm. On 
the other hand, their learning algorithm achieves error rates comparable to those 
of a whole ensemble of trees. 
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4:4/0.18,9/0.16 4:4/0.21,1/0.18 



7:4/0.25,9/0.18 1:9/0.15,7/0.15 2:0/0.29,2/0.19 9:7/0.25,9/0.17 

i M 

4:1/0.23,4/0.23 4:1/0.21,4/0.20 4:9/0.16,4/0.16 9:9/0.17,4/0.17 

7:7/0.24,9/0.21 9:9/0.25,7/0.22 4:4/0.19,9/0.16 9:9/0.20,7/0.17 



Fig. 5. A sample of the examples that have the largest weight on an OCR 
task as reported by Freund and Schapire m- These examples were chosen after 
4 rounds of boosting (top line), 12 rounds (middle) and 25 rounds (bottom). 
Underneath each image is a line of the form ^ 2 / 702 , where d is the label 

of the example, i\ and are the labels that get the highest and second highest 
vote from the combined hypothesis at that point in the run of the algorithm, 
and wi, W 2 are the corresponding normalized scores. 



A nice property of AdaBoost is its ability to identify outliers, i.e., examples 
that are either mislabeled in the training data, or which are inherently ambiguous 
and hard to categorize. Because AdaBoost focuses its weight on the hardest 
examples, the examples with the highest weight often turn out to be outliers. 
An example of this phenomenon can be seen in Fig. 0 taken from an OCR 
experiment conducted by Freund and Schapire EDI- 

When the number of outliers is very large, the emphasis placed on the hard 
examples can become detrimental to the performance of AdaBoost. This was 
demonstrated very convincingly by Dietterich m Friedman et al. m suggested 
a variant of AdaBoost, called “Gentle AdaBoost” which puts less emphasis on 
outliers. In recent work, Freund suggested another algorithm, called “Brown- 
Boost,” which takes a more radical approach that de-emphasizes outliers when 
it seems clear that they are “too hard” to classify correctly. This algorithm is 
an adaptive version of Freund ’s m “boost-by-majority” algorithm. This work, 
together with Schapire ’s |SH! work on “drifting games,” reveal some interest- 
ing new relationships between boosting. Brownian motion, and repeated games 
while raising many new open problems and directions for future research. 
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Abstract. We are concerned with the problem of sequential prediction 
using a given hypothesis class of continuously-many prediction strategies. 
An effective performance measure is the minimax relative cumulative loss 
(RCL), which is the minimum of the worst-case difference between the 
cumulative loss for any prediction algorithm and that for the best as- 
signment in a given hypothesis class. The purpose of this paper is to 
evaluate the minimax RCL for general continuous hypothesis classes un- 
der general losses. We first derive asymptotical upper and lower bounds 
on the minimax RCL to show that they match (fc/2c) In m within error of 
o(lnm) where k is the dimension of parameters for the hypothesis class, 
m is the sample size, and c is the constant depending on the loss function. 
We thereby show that the cumulative loss attaining the minimax RCL 
asymptotically coincides with the extended stochastic complexity (ESC), 
which is an extension of Rissanen’s stochastic complexity (SC) into the 
decision-theoretic scenario. We further derive non-asymptotical upper 
bounds on the minimax RCL both for parametric and nonparametric 
hypothesis classes. We apply the analysis into the regression problem to 
derive the least worst-case cumulative loss bounds to date. 



1 Introduction 

1.1 Minimax Regret 

We start with the minimax regret analysis for the sequential stochastic prediction 
problem. Let y be an alphabet, which can be either discrete or continuous. We 
first consider the simplest case where y is finite. Observe a sequence yi,y 2 ,- " 
where each yt(t = 1, 2, • • •) takes a value in y. A stochastic prediction algorithm 
A performs as follows: At each round t = 1,2, A assigns a probability mass 
function over y based on the past sequence y^~^ = Vi' ■ ■ Vt-i- Tbe probability 
mass function can be written as a conditional probability P{-\y*~^). After the 
assignment, A receives an outcome yt and suffers a logarithmic loss defined by 
— \nP{yt\y*~^). This process goes on sequentially. Note that A is specified by 
a sequence of conditional probabilities: {P{-\y*~^) \t = 1,2, ..}. After observing 
a sequence y"* = yi ■ • • Vm of length m, A suffers a cumulative logarithmic loss 
EIli (-lnP(yt|y‘“^)) where P{-\yo) = Po{-) is given. 



O. Watanabe, T. Yokomori (Eds.): ALT’99, LNAI 1720, pp. 26-|2^ 1999. 
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The goal of stochastic prediction is to make the cumulative loss as small as 
possible. We introduce a reference set of prediction algorithms, which we call 
a hypothesis class, then evaluate the cumulative loss for any algorithm relative 
to it. For sample size m, we define the worst-case regret for A relative to a 
hypothesis class hi by 

( m m 

^ {- In P{yt\y*~^)) - M'^{-lnf{yt\y*~^)) 

t=i t=i 

which means that the worst-case difference between the cumulative logarithmic 
loss for A and that for the best assignment of a single hypothesis in Ti,. Further 
we define the minimax regret for sample size m by 

Rm{n)=miRm{A:n), 

A 

where the infimum is taken over all stochastic prediction algorithms. Note that 
in the minimax regret analysis we require no statistical assumption for data- 
generation mechanism, but consider the worst-case with respect to sequences. 

Notice here that for any to, a stochastic prediction algorithm specifies a joint 
probability mass function by 

m 

P{y^) = \{P{ytW-^)- ( 1 ) 

t^l 

Thus the minimax regret is rewritten as 

Rm{n) = mf sup In . 

Shtarkov j0| showed that the minimax regret is attained by the joint probability 
mass function under the normalized maximum likelihood, defined as follows: 

. ^ sup;g^ fjyn 

T,y^ SUp;g„/(?/™) 

and then the minimax regret amounts to be 

i?„(?f)=^sup/(2/™). (2) 

^ fen 

Specifically consider the case where the joint distribution is given by a prod- 
uct of probability mass function belonging to a parametric hypothesis class 
given by Hk = {Pe{-) ■ 9 G 0} where 0 is a fc-dimensional compact set in 
Let 9 be the maximum likelihood estimator (m.l.e.) of 9 from (i.e., 
9 = argmaxg^e Pg{y"^) where Pg{y™') = Jlt^i PeiVt))- Rissanen jS| proved that 
under the condition that the central limit theorem holds for m.l.e. uniformly 
over 0, RmiHk) is asymptotically expanded as follows: 

RmOdk) = ^ J + o(l); 




( 3 ) 
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where I{0) = {Ee[— In Pe{y)/d0id9j])ij denotes the Fisher information ma- 
trix and o(l) goes to zero uniformly with respect to as m goes to infinity. 

For a given sequence j/"* and Tik, the negative log- likelihood for under 
the joint probability mass function that attains the minimax regret is called the 
stochastic complexity (SC) of ?/"* (relative to Hk) 0,|Ij,|3|i which we denote as 
SC{y^). That is, 

SC{y^) = - In P^{y^) + ^ln^ + In J y/\m\de + o{l). (4) 

The SC of j/"* can be thought of as an extension of Shannon’s information in the 
sense that the latter measures the information of a data sequence relative to a 
single hypothesis while the former does so relative to a class of hypotheses. 

1.2 Purpose of This Paper and Overview of Results 

We extend the minimax regret analysis in two ways. One is to extend it into 
a general decision-theoretic scenario in which the hypothesis class is a class 
of real-valued functions rather than a class of probability mass functions, and 
the prediction loss is measured in terms of general loss functions (e.g., square 
loss,Hellinger loss) rather than the logarithmic loss. It is motivated by the fact 
that in real problems such as on-line regression and pattern-recognition, predic- 
tion should be made deterministically and a variety of loss functions should be 
used as a distortion measure for prediction. 

We analyze the minimax relative cumulative loss (RCL), which is an exten- 
sion of the minimax regret, under general losses. The minimax RCL has been 
investigated in the community of computational learning theory, but most of 
work are restricted to specific cases: 1) the case where general loss functions 
are used but a hypothesis class is finite (e.g., ^0], |2]) and 2) the case where a 
hypothesis class is continuous but only specific loss functions such as the square 
loss and the logarithmic loss are used (e.g., n, 0,0,0,1111), or only specific hy- 
pothesis classes such as the Bernoulli model and the linear regression model (e.g., 
0,1111) are used. This paper offers a universal method of minimax RCL analysis 
relative to general continuous hypothesis classes under general loss functions. 

We first derive asymptotical upper and lower bounds on the minimax RCL 
to show that they match (k/2c)\nm within error of o(ln77i) where k is the 
dimension of parameters for the hypothesis class, m is the sample size, and c is 
the constant depending on the loss function. According to H2|, we introduce the 
extended stochastic complexity (ESC) to show that the ESC is an approximation 
to the minimax solution to RCL. The relation between ESC and the minimax 
RCL is an analogue of that between SC and the minimax regret. This gives a 
unifying view of the minimax RCL/the minimax regret analysis. 

The other way of extension is to introduce a method of non-asymptotical 
analysis, while (0 is an asymptotical result, that is, it is effective only when the 
sample size is sufficiently large. Since sample size is not necessarily large enough 
in real situations, theoretical bounds that hold for any sample size would be 
practically useful. 
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Non-asymptotical bounds on the minimax regret and RCL were derived in 
pj],i2i,[ni for continuous hypothesis classes under specific losses. We take a 
new approach to derive non-asymptotical bounds on the minimax RCL both for 
parametric and non-parametric hypothesis classes under general losses. 

The rest of this paper is organized as follows: Section 2 gives a formal defini- 
tion of the minimax RCL. Section 3 overviews results on asymptotical analysis 
of the minimax RCL. Section 4 gives non-asymptotical analysis of the minimax 
RCL. Section 5 shows an application of our analysis to the regression problem. 



For a positive integer n, let T be a subset of IR", which we call the domain. 
Let y — {0, 1} or 3^ = [0, 1], which we call the range. Let Z = [0, 1] or Z be a 
set of probability mass functions over y. We call Z the decision space. We set 
V = X X y. We write an element inV as D = {x, y). Let L : y x Z ^ IR"'' U {0} 
be a loss function. 

A sequential prediction algorithm A performs as follows: At each round t = 
1, 2, • • • , A receives Xt G X then outputs a predicted result zt G Z on the basis 
of = Di ■ ■ ■ Dt-i where Di = (xi,yi) {i = 1, • • • , t — 1). Then A receives the 
correct outcome yt G y and suffers a loss L{yt, Zt). Hence A defines a sequence 
of maps: {ft : t = 1,2, • • •} where ft(xt) = Zt. A hypothesis class is a set of 
sequential prediction algorithms. 

Definition 1. For sample size m, for a hypothesis class Ft, let he a 

subset of T>^ depending on H. For any sequential prediction algorithm A, we 
define the worst-case relative cumulative loss (RCL) by 



where Zt is the output of A at the tth round. We define the minimax RCL by 



where the infimum is taken over all sequential prediction algorithms. We assume 
here that any prediction algorithm knows the sample size m in advance. 

Consider the special case where A = 0,3^={O,1},Z =the set of probability 
mass functions over y, and the loss function is the logarithmic loss: L{y,P) = 
— lnP(y). (Throughout this paper we call this case the probabilistic case.) We 
can easily check that in the probabilistic case the minimax RCL (0) is equivalent 
with the minimax regret (0. 

Hereafter, we consider only the case where Z = [0, 1], that is, the prediction 
is made deterministically. Below we give examples of loss functions for this case. 



2 Minimax RCL 




■R^(n) = inf P^(A:n), 
A 



( 5 ) 
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L{y, z) = {y — z)'^ (square loss), 

y 1 — y 

L(y, z) = yin — h (1 — y) In (entropic loss), 

z 1 — z 

L{y,z) = \ {[y/y- + (\/l -y - Vl - ^ (Helllnger loss), 

L{y, z) = i(— (2y — l)(2z — 1) + ln(e^^“^ + e“^^“'"^) + B) (logistic loss), 
where B = ln(l + e“^). 

For a loss function L, we define Lq(z) and Fi(z) by Lq{z) F(0,z) and 
Li(z) L(l, z), respectively. We make the following assumption for L. 

Assumption 1 The loss function L satisfies: 

1) Lq{z) and Li(z) are twice continuously differentiable with respect to z. Lo(0) = 
Li(l) = 0. For any 0 < z < 1, Lq{z) > 0 and L'ffz) < 0. 

2) Define A* by 



def 



( sup 

\0<z<l 



L'o{z)L[{zr 

L'q{z)L'I{z) 



L'i{z)L',{zf 

L((z)L"(z) 



-1 



( 6 ) 



Then 0 < A* < oo. 

3) Let G{y,z,w) = \*{L{y,z) - L{y,w)). For any y, z,w G [0,1], d^G{y,z,w)/ 
dy'^ + {dG{y, z, w)/dyff > 0. 

For example. A* = 1 for the entropic loss. A* = 2 for the square loss, and 
A* = -\/2 for the Hellinger loss. In the case of 3^ = {0, 1} instead of 3^ = [0, 1], 
Condition 3) is not necessarily required. 

3 Asymptotical Results 

According to we introduce the notion of ESC in order to derive upper 
bounds on the minimax RCL. 

Definition 2. Let y. be a probability measure on a hypothesis class TL. For a 
given loss function L, for a sequence G we define the extended stochas- 
tic complexity (ESC) of relative to H by 

I{D^ :H)=^ -^In J e”^* (7) 

While SC was defined as the cumulative logarithmic loss that attains the min- 
imax regret, ESC is not defined as the cumulative loss that attains the minimax 
RCL. This is because it is not possible to get an explicit form of the minimax 
solution to RCL for general losses. It will turn out in this section, however, that 
the ESC is a tight upper bound on the cumulative loss that attains the minimax 
RCL within error of o(ln m ) . 
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Lemma 1. Under Assumption^ there exists a sequential prediction algorithm 
such that for any , its cumulative loss is upper-bounded by I{D"^ : H). 



Lemma m can be proven using Vovk’s aggregating algorithm HDj with its 
analysis in 0. 

In order to make a connection between ESC and the minimax RCL, we focus 
on the specific case where is a parametric class such that any prediction 
algorithm in H is written as a sequence of functions each of which belongs to a 
parametric class Hk = {/e(-) : 9 S 0} where 0 is a fc-dimensional compact set 
in In this case the ESC of D™ relative to Hk is written as 

I{D^ :Hk) = -^ln J d07r(6»)e”^* i HvyJoUt))^ (g) 

where 7r(0) is a prior probability density over 0. 

We make the following assumption for L,Hk, and tt. 



Assumption 2 The following conditions hold for L,Hk, und tt; 

1) Define a matrix J{0) by 

TJLiHyu fejxt)) 

^ ^ m\ de,d9j 

and let : 9) be the largest eigenvalue of J{9). Let d = dm be a sequence 

such that dm > k, lim^^oo dm = oo, and limm^oo(dm/w) = 0. Let 9 be the 
minimum loss estimator of 9 defined by 9 = argmingg© fe{xt)) Then 

for some 0 < yi < oo, 




sup sup ■ 9) < fi. 

e:|e-e|<(d„/m)i/2 

2) Let Nm =' {9 G 0 : \9-9\ < and {9 G : \9 - 9\ < 

Then for some 0 < r < 1, for all sufficiently large m, vol{Nm) > 
r X vol(Nff) where vol{S) is Lebesgue volume of S. 

3) For some c > 0, for any 9 G 0, 7t( 6*) > c. 

The following lemma gives an asymptotical upper bound on the worst-case 
ESC. 

Lemma 2. [E| Let T>"*(0) be a subset of T>^ such that dY^'^=iL(]jt, fe{xt)) / 
^^\e=e — Under Assumption^ for all 0"* G V'^{0), we have 

^ Ic 1 1 

/(D™ : Hk) < L{yt,fe{xt)) + — In -h — In — -h o(l), (9) 

eeO 2A* ztt A* re 

t=i 

where limm^oo o(l) = 0 uniformly over 2A™(0). 
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Note that in the probabilistic case, A* = 1 hence the righthand side of Q 
coincides the SC of (0) within error of 0(1). 

The main technique to prove (Q is Laplace’s method, which approximates 
an integral using a Gaussian integral in the neighborhood of the minimum loss 
estimator. The proof is sketched as follows: 

Proof. For sample size m, let Sm = \J dm/m, for dm as in Assumption El For a 
minimum loss estimator 9 of 0 from D™, let Ns^ = {9 G O : \9 — 9\ < 6m} be 
a neighborhood of 9. We denote L{yt, fe(xt)) as L{D'^ : fg). Then for all 
jjm g X>™{0), letting r] G we have 



/(F>™ : -Hk) 

<-^ln f d07T(0) exp : fg)) 

- " ^ In X 

< - ^ In ^ d9i:{9) exp : fg) - 

< : fg) — -^ln7r(6**) +ln f d9exp 



A*m 



(9-0)^Jir,)i0-0) 



X* pm 






X* pm 



Y,{o^ - k)^ , ( 10 ) 



where 9* is the point in that attains the minimum of 7 t( 0). The notations 
of J{ri) and p follow Assumption 0 

Let = {0eTR^ :\9-0\< Sm}- Then 




X* pm 



y^(6*i - k)"^ \d9>r I exp ( ~ k)"^ 1 d9 






/N 



> r 1 — 



2=1 






(11) 



See for the last inequality. Plugging (ITTll into (ITO and letting dm go to 
infinity yield 0 . □ 

Combining Lemmas Q with El leads the following asymptotical upper bound 
on the minimax RCL. 



Theorem 1. Under Assumptions\J\and\^ 



„ , fc mX*p 1 , 1 ... 

R.n(H0 S 2)^ In — + ^ 1>I - + 0(1). 



(12) 



where T>™‘{Ti) as in Definition^is set to T>"*(0). 
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In order to investigate how tight li I '/ill is, we derive an asymptotical lower 
bound on the minimax RCL. 

Theorem 2. m- When L is the entropic loss or the square loss, for some 
regularity condition for TL, 

'R-rni'Hk) > - o(l)^ Inm. (13) 

Furthermore, under some regularity conditions for Tik and a general L in addi- 
tion to Assumptions^ and H ED holds. 

(Note: The current forms of the regularity conditions required for general 
losses are very complicated M and remain to be simplified. ) 

By Theorems d and El we have the following corollary relating the ESC to 
the minimax regret. It is formally summarized as follows: 

Corollary 1. Let 

C m 

I{D^ : Uk) - L{yt, fe{xt)) 

® 7^1 

Then under the conditions as in Theorems Q and d 

|X^(7f)-7^™(7^fc)| ^ 

lim ^ = (J. 

m— >cxD In m 

Corollary d shows that the ESC can be thought of as the cumulative loss 
that attains the minimax RCL within error of o(lnm). It corresponds to the fact 
that SC is the cumulative logarithmic loss that attains the minimax regret. This 
gives a rationale that ESC is a natural extension of SC. 




4 Non-asymptotical Results 

4.1 Log-Loss Case 

Bounds 0, (d and m is asymptotical in the sense that they are effective 
only when the sample size m is sufficiently large. This section derives non- 
asymptotical upper bounds on the minimax RCL, which might not be tight, 
but hold for any sample size. First, we overview the results by Cesa-Bianchi and 
Lugosi d for the probabilistic case under the logarithmic loss. 

Let (S,d) be a metric space where S' is a class of joint probability mass 
functions each of which is decomposed as in m and d is the metric such that 

C m 

g) 

t=i 

where dt{f,g) = sup^^t \ln f{yt\y^~^) - In g{yt\y*~'^) 
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For r C S', for £ > 0, let N{T,e) be the cardinality of the smallest subset 
T' C S such that for all f €T, for some g S T', d{f, g) > e. 

The following theorem shows a non-asymptotical upper bound on the mini- 
max regret. 

Theorem 3. Q For any class Ti., 



n,n{n) < inf 

e>0 



lnN{'H,e) + 12 / \/ln N{'H, S)d6 



In deriving the above bound we don’t require regularity condition for such 
as the central limit theorem condition required for Q to be satisfied. It actually 
holds regardless is parametric or no-parametric. 

Theorem^leads as a special case the following non-asymptotical upper bound 
on the minimax regret for parametric hypothesis classes. 

Corollary 2. Consider a class 7i such that for some positive constants k 
and c, for all e > 0, 



In e) < kin (c\/m/e) . (14) 

Then for m > (288 ln(c-y/m))/fcc^, we have 

.X, c^ln(cCm) , 

< -Inm + -In - + 5k. (15) 

Note that Condition holds for most classes parametrized smoothly by a 
bounded subset of . 

4.2 General-Loss Cases 

Next for a decision-theoretic case under general losses other than the logarith- 
mic loss, we derive non-asymptotical upper bounds on the minimax RCL in a 
different manner from Cesa-Bianchi and Lugosi’s. First we investigate the case 
where the hypothesis class is parametric. 

Theorem 4. H3! Under Assumptions Cl md 

7^„.(7^fe) < Ainm+ Ain(A>eF’2p.2/fc^ ^ 

where 'D’^iTCj as in DefinitionUlis set to T>^{0). 

Proof. The main technique to prove (II till is to discretize the hypothesis class 
with an appropriate size and then to apply the aggregating algorithm over the 
discretized hypothesis class. The most important issue is how to choose the 
number of discrete points. 

For F > 1, let Z\ be a subset of 0 whose size is N and for which the dis- 
cretization scale is at most F'/k{V/Ny^^ . That is, supg^g, min^/g/i \6 — 9'\ < 
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Fy/k{V/Ny^^ . This means that the discretization scale for each component in 
A is roughly uniform within a constant factor. It is a quite natural requirement 
for the discretization. 

Let 9^argmmg^0j2’^^L{yt,feixt)) and 9 a L{yt, fgixt)). 
We write the relative cumulative loss for any algorithm A as R{A : D™). Letting 
yt be the output of A at the tth round, we see 



R{A : D^) 



/ m m \ 

XI vt) - X fsi.xt)) ] 









+ ( X fsixt)) - X fei^t)) ] ■ 









(17) 



Letting Tik = {fe{-) ■ 9 G A} and A be the aggregating algorithm AG using 
Tife, by Lemma ^ we see 



J2L{yt,yt)<I{D^ :Hk) 

= In 1 V e-^* 1 AvtJeA,)) 

X* N ^ 

8gA 



IIL - 

< '^Hyt,fg{xt)) + — IniV. 



This leads: 



Dn 



C m m \ ^ 

'^L{yt,yt) -'^L{yt,fg{xt))\ <— InW 

t=i t=i / ^ 

By Taylor expansion argument, for all Z?"* G V"‘(6>), 

m m 

J^Myt,fg(xt)) - L(yt, fg(xt)) < ^l9-9j^ < 



(18) 



n\2 y.mF'^k f 



v,l ■(“> 



Plugging ifTT^ and (EH) into fTITjl yields 






sup i?(AG : Z?"*) < T In iV + 

'GX>™(0) 



A* 




(20) 



The minimum of (I2H w.r.t. N is attained by ZV = |'(A*/rTOF^)^/^F] . Plugging 
this optimal size into m and choosing A so that F is smallest yield IH). □ 



Next we investigate the case where the hypothesis class is nonparametric, 
but can be approximated by a sequence of parametric hypothesis classes. 
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Theorem 5. Let {7ik ■ k = 1,2,..} be a sequence of classes such that Tii C 
Ti .2 C • • ■ where Tik is a k-dimensional parametric hypothesis class. Let T he a 
hypothesis class such that for some C,a > 0, 

sup inf sup : /) - L{D'^ : h)\/m < C/k°^, ( 21 ) 

ffZjrhGT-lk 

where L{D^ : f) = Then under Assumption^ and^ for 

some A,B>0 depending on A*,C, and a, 

1 

< Am^ In^ m + B In . 

Proof Fix k. By (TTOIl and (EID, we have 

'R-rniiF) < TZmiTtk) + sup inf sup : /) - L{D'^ : h)\ 

- 2A* 2X* \ ^ J fc“ 

Setting k = (2A*aCm/lnm)~ in the last inequality yields (|22I)- □ 

Condition (EB means that the worst-case approximation error for Tlk to iF 
is 0(l/fc“). Eq. dinj is regarded as a special case of (El where a is infinite. 

5 Minimax RCL for Regression 

We apply the results in Sections 3 and 4 into the regression problem. We consider 
the case where the hypothesis class is a class of linear functions of a feature vector 
and the distortion measure is the square loss. Such a case has entensively been 
investigated in the linear regression (LR) scenario in statistics. Our analysis is 
different from conventional ones in the following regards: 

1) Although it is assumed in the classical LR that a noise is additive and is 
generated according to a Gaussian distribution, we don’t make any probabilistic 
assumption either for a target distribution or a hypothesis class, but instead 
perform worst-case analysis in terms of the worst-case RCL. Additionally, we 
emphasize that we consider the regression problem in the on-line prediction 
scenario rather than in the batch-learning scenario as in the classical LR. 

2) While most algorithms investigated in the classical LR take linear forms of 
a feature vector, we don’t restrict ourselves into them, but rather consider non- 
linear prediction algorithms using a hypothesis class of linear functions. 

Let X = {x = • • • , G [0, 1]^ : + ■ ■ • + ^ 1} 

j; = z = [0, 1]. Let 0 = { 6 » = ( 6 » 1 , • • • , 0^) G [0, l]'^ : e\ + ■ ■ ■ + 0l < 1}. Let a 
hypothesis class be 

Hk = {fe{X) = e^x: xGX, 0G O}. 

This is known as a class of linear predictors. Let L be the square loss function. 
Then A* = 2. For a > 0, let 7r(0) = where R(0) = 
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Below we describe the aggregating algorithm HQI.II3 using Tik, denoted as 
AG. At the tth round, AG takes as input and Xt, then outputs yt s.t. 




p{e\D^ 



-i)g-2(exy 



de 






p{e\D 







where 



p{0\D*-^) = 



g — 2 (0 — /^t ) ^ (^ — /J-t ) 






i=i 



Pt — Ef. ^A; 



= 



t-1 



t - 

i=i 



(p) 



where and denote the {p,q)th component of E, the pth compo- 

nent of Xj, and the pth component of At, respectively {p,q= 1, • • • , k), Sp^q = 1 
if p = q and Sp^q = 0 if p ^ q. 

We can set p = 2k, r = 1/2^, c = e““/R(0). By (P) we have the following 
upper bound on the worst-case RGL for AG: 



7^„(AG:7^fc) < jln^ + iln2'=e“R(0)+o(l). (22) 

4 7T Z 



Note that the parameters in Theorem 0 are: V = 7t^/^/ 2^G(1 -|- A:/2) and 
A = 1. By eg), we have 

^ , k , k , f ken \ _ 

*”("‘)-4i°”+4i"(pTTW7ij- 

Vovk nn derived a similar non-asymptotic upper bound on the worst-case 
RGL for the aggregating algorithm using linear predictors. His bound matches 
(E3» up to the I In m term. 

Kivinen and Warmuth proposed the gradient descent algorithm (GD) and 
the exponentiated gradient algorithm (EG) as sequential prediction algorithms 
using the linear predictors. Notice here that the outputs of both EG and GD are 
linear in x, whereas that of AG is not linear in x. They showed 

supR(GD : D"*) = 0(Vw), 

supi?(EG : D"*) = O (fcVmln/fc) . 

Kn. i'Z‘3\ shows that the upper bound on the worst-case RGL for AG is 0{klnm), 
which is smaller than those for GD and EG when m is sufficiently large and the 
parameter size k is fixed. 
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Abstract. This paper clarifies learning efficiency of a non-regular para- 
metric model such as a neural network whose true parameter set is an 
analytic variety with singular points. By using Sato’s b-function we rig- 
orously prove that the free energy or the Bayesian stochastic complexity 
is asymptotically equal to Ai logn — (mi — 1) log log nj-constant, where 
Ai is a rational number, mi is a natural number, and n is the number 
of training samples. Also we show an algorithm to calculate Ai and mi 
based on the resolution of singularity. In regular models, 2Ai is equal to 
the number of parameters and mi = 1, whereas in non-regular models 
such as neural networks, 2Ai is smaller than the number of parameters 
and mi > 1. 



1 Introduction 

From the statistical point of view, layered learning machines such as neural 
networks are not regular models. In a regular statistical model, the set of true 
parameters consists of only one point and identifiable even if the learning model 
is larger than necessary to attain the true distribution {over-realizahle case). On 
the other hand, if a neural network is in the over-realizable case, the set of true 
parameters is not one point or not a manifold, but an analytic variety with sin- 
gular points. For such non-regular and non-identifiable learning machines, the 
maximum likelihood estimator does not exist or is not subject to the asymp- 
totically normal distribution, resulting that their learning efficiency is not yet 
clarified However, analysis for the over-realizable case is necessary 

for selecting the optimal model which balances the function approximation error 
with the statistical estimation error 0 . 

In this paper, by employing algebraic analysis, we prove that the free energy 
F{n) ( or called the Bayesian stochastic complexity or the average Bayesian 
factor) has the asymptotic form 

F{n) = Ai logn — (mi — 1) log log n -|- 0(1), 

where n is the number of empirical samples. We also show that an algorithm 
to calculate the positive rational number Ai and the natural number m\ using 
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Hironaka’s resolution of singularities, and that 2Ai is smaller than the number 
of parameters. Since the increase of the free energy F(n + 1) — F(n) is equal 
to the generalization error defined by the average Kullback distance of the esti- 
mated probability density from the true one, our result claims that non-regular 
statistical models such as neural networks are the better learning machines than 
the regular models if the Bayesian estimation is applied in learning. 



2 Main Results 



Let p{y\x,w) be a conditional probability density from an input vector x G 
to an output vector y G with a parameter w G R’^, which represents 
probabilistic inference of a learning machine. Let 'p{w) be a probability density 
function on the parameter space R'^, whose support is denoted hy W = supp ip C 
We assume that training or empirical sample pairs {{xi,yi);i = l,2,...,n} 
are independently taken from q{y\x)q{x), where q{x) and q{y\x) represent the 
true input probability and the true inference probability, respectively. In the 
Bayesian estimation, the estimated inference Pn{y\x) is the average of the a 
posteriori ensemble. 



Pn{y\x) 



j p{y\x,w)pn{w)dw, 



Pn{w) 



1 

Zn 



n 

ip{w) Y\p{y^\xi,w), 

i=l 



where is a constant which ensures J pn(w)dw = 1. The learning efficiency 
or the generalization error is defined by the average Kullback distance of the 
estimated probability density Pn{y\x) from the true one q{y\x), 

K{n) = En{ [ log ^^^^^\ q{x,y)dxdy} 

J Pn{y\x) 

where £!«{•} shows the expectation value over all sets of training sample pairs. 

In this paper we mainly consider the statistical estimation error and assume 
that the model can attain the true inference, in other words, there exists a 
parameter wq G W such that p{y\x,wo) = q{y\x). Let us define the average and 
empirical loss functions. 



f{w) 

fn{w) 



j log q{y\x)q{x)dxdy, 

J P[y\x,w) 

- "V log 

^p{yi\x^,w)' 



Note that f{w) > 0. By the assumption, the set of the true parameters Wq = 
{w GW ] f{w) = 0} is not an empty set. Wo is called an analytic set or analytic 
variety of the analytic function f{w). Note that Wq is not a manifold in general, 
since it has singular points. 
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From these definitions, it is immediately proven ^ that the average Kullback 
distance K{n) is equal to the increase of the free energy F{n). 

K{n) = F{n+l) - F{n), 

F{n) = -Fl„{log J exp{-nfn{w))ip{w)dw}, 

where F(n) is sometimes called Bayesian stochastic complexity or Bayesian fac- 
tor. Theorem 1 and 2 are the main results of this paper. Let be a set of all 
compact support and C°°-class functions on R‘^. 

Theorem 1 Assume that f{w) is an analytic function and <p{w) is a probability 
density function on . Then, there exists a real constant Ci such that for any 
natural number n 

F{n) < Xi logn — (mi — 1) loglogn + Cl, (1) 

where the rational number — Ai (Ai > 0) and a natural number mi is the largest 
pole and its multiplicity of the meromorphic function that is analytically contin- 
ued from 

J(A) = f f{w)^Tp{w)dw (i?e(A) > 0). 

J f{w)<t 

where e > 0 is a sufficiently small constant, and Tp{w) is an arbitrary nonzero 
-class function that satisfies 0 < 'ip{w) < <p{w). 



Theorem 2 Let a > 0 be a constant value. Assume that 

p{y\x,w) = ^exp( — ), 



where if{x,w) is an analytic function for w G R'^ and a continuous function 
for X G R^ . Also assume that <p{w) is a Cq° -class probability density function. 
Then, there exists a constant C 2 > 0 such that for any natural number n 



|F(n) - Ai logn + (mi - 1) loglogni < C 2 , 



where the rational number — Ai (Ai > 0) and a natural number mi is the largest 
pole and its multiplicity of the meromorphic function that is analytically contin- 
ued from 

= [ f{w)^(p{w)dw (i?e(A) > 0). 

J f(w)<e 

where e > Q is a sufficiently small constant. 



From Theorem 2, if the average Kullback distance K{n) has an asymptotic 
expansion, then 

K{n) = ^ + o(— ). 

n n log n n log n 

As is well known, regular statistical models have Ai = d/2 and mi = 1. However, 
non-regular models such as neural networks have smaller Ai and larger mi in 
general jS]. 
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3 Proof of Theorem 1 

Lemma 1 The upper bounds of the free energy is given by 
F{n) < — log J exp{—nf{w))<p{w)dw. 

[Proof of Lemma 1] From Jensen’s inequality and — /) = 0, lemma 1 is 

obtained. (Q.E.D.) 

For a given e > 0, a set of parameters is defined by 

= {w €W = supp f{w) < e}. 



Theorem 3 (Sato, Bernstein, Bjork, Kashiwara) Assume that there exists €q > 
0 such that f{w) is an analytic function in Weg- Then there exists a set (e, P, b), 
where 

(1) e < eo is a positive constant, 

(2) P = P{X, w, dw) is a differential operator which is a polynomial for X, and 

(3) 6(A) is a polynomial, such that 

P{X, w, dw)f{w)^+^ = b{X)f{w)^ Cw e VFe/ A e C). 

The zeros of the equation 6(A) = 0 are real, rational, and negative numbers. 

[Explanation of Theorem] This theorem is proven based on the algebraic property 
of the ring of partial differential operators. See references 000 The rationality 
of the zeros of 6(A) = 0 is shown based on the resolution of singularity 0[E|- 
The smallest order polynomial 6(A) that satisfies the above relation is called a 
Sato-Bernstein polynomial or a b-function. 

Hereafter e > 0 is taken smaller than that in this theorem. We can assume that 
e < 1 without loss of generality. For a given analytic function f(w), let us define 
a complex function J(A) of A G (7 by 

JW = [ f{w)^ip{w)dw. 



Lemma 2 Assume that is a -class function. Then, J{X) can be ana- 

lytically continued to the meromorphic function on the entire complex plane, in 
other words, J(A) has only poles in |A| < oo. Moreover J{X) satisfies the follow- 
ing conditions. 

(1) The poles of J{X) are rational, real, and negative numbers. 

(2) For an arbitrary a G R, J(oo + a^/—l) = 0, and J{a ± oo • V^) = 0. 
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[Proof of Lemma 2] J(A) is an analytic function in the region Re(A) > 0. J(oo + 
a\/^) = 0 is shown by the Lebesgue’s convergence theorem. For a > 0, J(a + 
oo\/~l) = 0 is shown by the the Riemann-Lebesgue theorem. Let P* be the 
adjoint operator of P = P{\, w, d^)- Then, by Theorem 3, 

f Pf{w)^^'^^{w)dw = -J— [ f{w)^+'^P*ip{w)dw 

^(A) Jw, Jw, 

Because P*ip S C°°, J(A) can be analytically continued to J{\ — 1) if 6(A) ^ 0. 
By analytic continuation, then even for a < 0, J(a + oo-\/^) = 0. For 6(A) = 0, 
then such A is a pole which is on the negative part of real axis. (Q.E.D.) 

Definition Poles of the function J(A) are on the negative part of the real axis 
and contained in the set {m + v,m = Q, —1, —2, ..., b{v) = 0}. They are ordered 
from the bigger to the smaller by — Ai, — A2, — A3, • • • , (Afc > 0 is a rational 
number.) and the multiplicity of —Afc is defined by Wfc. 

We define a function I (t) from i? to i? by 



I{t) = [ 5{t- f{w))(p{w)dw, (2) 

where e > 0 is taken as above. By definition, if t > e or f < 0 then I{t) = 0. 

Lemma 3 Assume that ip(w) is a -class function. Then I(t) has an asymp- 
totic expansion for t — > 0. (The notation = shows that the term can be asymp- 
totically expanded.] 

00 mfe — 1 

Ck.m+lt^'°~^{-\0gt)'^ (3) 

k—1 m— 0 

where ml ■ Ck,m-hi coefficient of the (to + l)-th order in the Laurent 

expansion of J(A) at X = — Afc. 

[Proof of Lemma 3[ The special case of this lemma is shown in jin|. Let Ik ft) 
be the restricted sum in I{f) from /c = 1 to fc = iF. It is sufficient to show that, 
for an arbitrary fixed K, 

lim(/(t)-/K(i))6^ = 0 ("'A> -Ak+1 + 1). (4) 



From the definition of J(A), J(A) = / I{t)t^dt. The simple calculation shows 

1 

j t^+A<=-i(_iogi)'"di = Therefore, 

Jo y ^ ’ (A + Afc)™+1 



m 



K rrifc — 1 

- lK{t))t^dt = J(A) - £ 

k—1 m—0 



Cfe,m+1 

(A + Afc)™+i’ 
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By putting t = e ^ and by using the inverse Laplace transform and the previous 
Lemma 2, the complex integral path can be moved by regularity, which leads 

eq-dil). (Q.E.D.) 

[Proof of Theorem 1] By combining the above results, we have 



F{n) < — log / exp{—nf{w))ip{w)dw < — log / exp{—nf{w))ip{w)dw 
Jw Jw^ 

/■i /■" f (jf 

= — log / = — log / e“*/( — ) — 

Jo Jo n n 



10 

oo mfc — 1 m 

= -iog{E E E 

k—1 m—0 j—0 



Cfc,m+1 {mCj') 



7,Afc 






= Ai log n — (mi — 1) log log n + 0(1) 
where I{t) is defined by eq.© with ^{w) instead of <f{w). (Q.E.D.) 



4 Proof of Theorem 2 



Hereafter, we assume that the model is given by 

p{y\x, w) = exp(-i(y - tp{x, w)f)). 

It is easy to generalize the result to a general standard deviation {a > 0) case 
and a general output dimension {N > 1) case. For this model, 

f(w) = \ j (V'(a;,w) - ii){x,wo)f q{x)dx 

^ n 1 ^ 

fn{w) = — w) - V'(a;*, wo)f ^ ??»(V'(a;*, w) - i/j{xi, wo)) 



i=l 



i=l 



where {rji = yi — Jj^XijWo)} are independent samples from the standard normal 
distribution. 



Lemma 4 Let {xi,r]i}^^i be a set of independent samples taken from q{x)qo{y), 
where q{x) is a eompaet support and eontinuous probability density and qo{y) is 
the standard normal distribution. Assume that the function f,{x,w) is analytic 
for w and continuous for x, and that the Taylor expansion of f{x,w) among w 
absolutely converges in the region T = {w; \wj —Wj\ < Vj}. For a given constant 
0 < a < 1, we define the region Ta = {w; \wj — Wj\ < avj}. Then, the followings 
hold. 

(1) If f ^(x,w)q(x)dx = 0, there exists a constant c' such that for an arbitrary 
n, 

1 ” 

An = En{ sup |^'V^(a;j,w)p} < c' < oo 

Vn 
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(2) There exists a constant c" such that for an arbitrary n, 

1 ” 

Bn = En{ sup < c" < oo 

[Proof of Lemma 4] We show (1). The statement (2) can be proven by the same 
method. This lemma needs proof because sup^gy^ is in the expectation. We 
denote k = (ki, fe, kd) and 

OO OO 

^{x,w) = '^ak{x){w -w)'" = ^ 

k—0 ki,...,kd—0 

Let K = supp q{x) be a compact set. Since ^{x, w) is analytic for w, by Cauchy’s 
integral formula for several complex functions, there exists <5 > 0 such that 

d 

\ak{x)\ < M/ TT \rj - M = max |^(a;,w)|, 

X^K.W^Ta 

i-1 

and that / ak{x)q{x)dx = 0. Thus 

1 ^ r A/f 

= {/ \ak{x)\^q{x)dx}i < 

^ J llj Fj - 

Therefore, 



1 1 
^n=Sn{sup|^> ^{Xi 



ri oo 

^}5=S„{sup |^^^afe(a;*)(w- 



i —1 k—Q 



< 



LXJ ^ !L 

sup \^^^ak{xi){w < oo 
fe=0 i=l 



where S is taken so that arj < rj — S {j = 1, 2, ..., d). (Q.E.D.) 
The function (n{w) is defined as follows. 



Cn{w) 



Vn{f{w) - fnjw)) 

V/H 



Note that Cn{w) is holomorphic function of w except Wq- 



Theorem 4 Assume that 4>{x, w) is analytic for w G and continuous for 
X G . Also assume that q{x) is a compact support and continuous function. 
Then, there exists a constant C 3 such that for arbitrary n 

En{ sup |Cn(w)|^} < C3. 
w£W\Wo 



where W C R'^ is a compact set. 
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[Proof of Theorem 4] Outside of the neighborhood of Wq , this theorem can be 
proven by the previous lemma. We assume that W = W^. By compactness of W, 
W is covered by a union of finite small open sets. Thus we can assume w G W 
is in the neighborhood of wq G W. By inductively applying the Weierstrass’ 
preparation theorem m to the holomorphic function ip{x,w) — ip{x,wo), there 
exists a finite set of functions {gj, where gj{w) is a holomorphic function 

and hj{x,w) is a continuous function for x and a holomorphic function for w, 
such that 

J 

tjj{x,w) - 'tp{x,wo) = '^gj{w)hj{x,w). (5) 

i=i 



where the matrix 



Mjk = 



hj{x, wo)hk{x, wo)q{x)dx 



is positive definite. Let a > 0 be taken smaller than the minimum eigen value 
of the matrix Mjk- By the definition, 



f{w) = 



1 

2 



j 

9i{w)9k{w) 

3,k=l 



J hj{x,w)hk{x,w)q{x)dx, 



is bounded by f{w) > ^ \gj{w)\'^. by taking small e > 0. We define 






1 X ^ 1 X > 

Mw) = 2 51 g3{w)gk{w)-'^ajk{x^,w) 

1 i—1 

ajk{xi,w) = J hj{x,w)hk{x,w)q{x)dx - hj{xi,w)hk{xi,w), 

J 1 ^ 

B{w) = ^gj{w){-'^r]ihj{x^,w)}. 

j=l i=l 

Then A{w) + B{w) = f{w) — fn(w). By the Cauchy-Schwarz inequality, 

/H 

2r? 

< + |B(w)p)} < Const. 

For the last inequalities, we applied the previous lemma 4. (Q.E.D.) 
[Proof of Theorem 2] We define 

ttn= sup |Cn(w)|- 
wew\Wo 
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Then, by Theorem 4, < oo. The free energy or the Bayesian stochastic 

complexity satisfies 

F{n) = -En{log / ey.^{-nfn{w))Lp{w)dw} 

Jw 

= -En{log / exp{-nf{w) - \J nf{w)Cn{w))ip{w)dw} 

Jw 

>-En{log / exp{-nf{w) + o;„\/n/(w))(^(w)(iw}. 

Jw 

Let us define Zi{n) (f = 1, 2) by 

Zi{n) = / exp{-nf{w) + nf {w))ip{w)dw 

Jw{i) 

where W{1) = W, and W{2) = W\W,. Then 

F{n) > -En{log{Zi{n) + ^ 2 («))} 



= -if„{logZi(n)} - if„{log(l + 144)}- 

Zi[n) 



( 6 ) 



Let Ei{n) and E 2 {n) be the first and the second terms of eq.©, respectively. 
For Ei(n), the same procedure as the upper bound can be applied, 



C/c ,m+l ‘m Cj {log ny 



pn 

/ e-‘+“"^t^'=-i(-logt)™-^dt} ] 

Jo 



Fi{n) ^ -T;„[log{ 

k^m,j 

pn 

= Ai logn — (toi — 1) loglogn — if„{log / + 0(1) 

Jo 

= Ai log n — {mi — 1) log log n + 0(1) 

where we used OnVi < (l/2)(t + a^)- The term ^ 2 ( 71 ) is evaluated by using 
^ (logn)™i“^ 



Zi> exp{—nf{w))ip{w)dw > ci^r, 
Jwe 



and 



Z 2 ^ 



/ exp(- 
Jw\w, 



nf{w) — ai. , ,, ,, / ne — ai. 

)(fi{w)dw < (1 - ^p{Wy)exp{ ), 



we obtain Z^jZi < exp(a^/2) for sufficiently large n. Hence 



F 2 (n) > -£l„{log(l + exp(^))} > -Er,{^} - Er,{\og ^ ^ ^ } > - 00 . 



yr. 

2 



exp(^) 



In the last inequality, we used Theorem 4. (Q.E.D.) 
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5 Algorithm to Calculate the Learning Efficiency 



The important values Ai and mi can be calculated by resolution of singularities. 
Atiyah showed H3| that the following theorem is directly proven from Hironaka’s 
theorem. 



Theorem 5 (Hironaka) Let f{w) he a real analytic function defined in a neigh- 
borhood of 0 G . Then there exist an open set U D 0, a real analytic manifold 
U' and a proper analytic map g \U' ^ U such that 

(1) g : IT \ A' ^ U \ A is an isomorphism, where A = /“^(O) and A' = g~"^{A), 

(2) for each P G U' there are local analytic coordinates {ui,...,Ud) centered at 
P so that, locally near P, we have 

f{g{ui , ..., Ud)) = h{ui, .., ud)u\^u^^ ■■■Ud'‘ 

where h is an invertive analytic function and ki > 0. 

This theorem shows that the singularity of / can be locally resolved. The 
following is an algorithm to calculate Ai and mi. 

Algorithm to calculate the singular learning efficiency 

(1) Cover the analytic variety Wq = {w G suppt/?; f{w) = 0} by the finite union 
of open neighborhoods Ua- 

(2) For each neighborhood Ua, find the analytic map g by using blowing up. 

(3) For each neighborhood, the function Ja(A) is calculated. 



JaW = / f{w)^(fi{w)dw 






f{9{w))^T{g{u))\g'\du 






h{u)^Y[uf'^'ip{g{u))\g'\du, 

i=l 



where \g'\ is Jacobian. The last integration can be done for each variable Ui, and 
poles and their multiplicities, of Ja{z) are obtained. 

(4) By J(A) = JaW, poles and their multiplicities can be calculated. 

Example. 1 (Regular Models) For the regular statistical models, by using the 
appropriate coordinate (wi, ■..,Wd) the average loss function f{w) can be locally 
written by 

d 

/M = 

i=l 

The blowing up of the singularity, we find a map g : {ui, ...,Ud) > (wi, ■..,Wd), 



wi = ui, Wi = uiUi {2 < i < d) 
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Then the function J(A) is 

/ d 

^ W1W2, U1M3, uiud)duidu2 • ■ ■ dud 

i=2 

This function has the pole at A = —d/2 with the multiplicity mi = 1. Therefore, 
the free energy is 

F{n) = ^logn + 0(l). 

In this case, we also calculate the Sato’s b-function El 



Example. 2 If the model 



p{y\x,a,b) = -^exp(-i(?/ - atanh(5a;))^) 



is trained using samples from p{y\x, 0, 0), then 

f{a,b) = J ta,nh{bx)^q{x)dx. 



In this case, the deepest singularity is the origin, and in the neighborhood of 
the origin, f{a,b) = a?b^ . From this fact, it immediately follows that Ai = 1/2, 
mi = 2, resulting that 

F{n) ~ i logn — loglogn + 0(1). 



Example. 3 Let us consider a neural network 
p{y\x,a,b,c,d) = 



exp(-i(j/ - tp{x, a, b, c, d))^). 



'ip{x, a, b, c, d) = atanh(6x) + ctanh(dx). 



Assume that the true regression function be 0, 0, 0, 0). Then, the deepest 
singularity of /(o, b, c, d) is (0, 0, 0, 0) and in the neighborhood of the origin, 

/(a, 5, c, d) = {ab + cd^ + {ab^ + cd^Y 



since the higher order term can be bounded by the above two terms (see j 1. 
By using blowing-up twice, we can find a map g : (x, y, z, w) 1 — > (a, 6, c, d) 

a = x, b = y^w — yz, c = zx, d = y. 



By using this transform, we obtain 

f{g{x, y, z, w)) = x^y^lw^ + {(y^w - zf + z}^], 
lg'(x,y,z,w)l = \xy^\, 

resulting that Ai = 2/3, and mi = 1, and F{n) = (2/3) logn -I- 0(1). 



For the more general cases, some inequalities were obtained P]|I2|. It is shown 
in m that, for all cases, 2Ai < d. 
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6 Conclusion 

Mathematical foundation for singular learning machines such as neural networks 
is established based on the algebraic analysis. The free energy or the stochastic 
complexity is asymptotically given by Ai logn— (toi — 1) log log n + const., where 
Ai and mi are calculated by resolution of singularities. 

Analysis for the maximum likelihood case or the zero temperature limit is an 
important problem for the future. We expect that algebraic analysis also plays 
an important role for such analysis. 
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Abstract. The statistical asymptotic theory is often used in theoreti- 
cal results in computational and statistical learning theory. It describes 
the limiting distribution of the maximum likelihood estimator (MLE) 
as an normal distribution. However, in layered models such as neural 
networks, the regularity condition of the asymptotic theory is not nec- 
essarily satisfied. The true parameter is not identifiable, if the target 
function can be realized by a network of smaller size than the size of the 
model. There has been little known on the behavior of the MLE in these 
cases of neural networks. In this paper, we analyze the expectation of the 
generalization error of three-layer linear neural networks, and elucidate 
a strange behavior in unidentifiable cases. We show that the expectation 
of the generalization error in the unidentifiable cases is larger than what 
is given by the usual asymptotic theory, and dependent on the rank of 
the target function. 



1 Introduction 

This paper discusses a non-regular property of multilayer network models, caused 
by its structural characteristics. It is well-known that learning in neural networks 
can be described as the parametric estimation from the viewpoint of statistics. 
Under the assumption of Gaussian noise in the output, the least square error 
estimator is equal to the maximal likelihood estimator (MLE), whose statistical 
behavior is known in detail. Therefore, many researchers have believed that 
the behavior of neural networks is perfectly described within the framework of 
the well-known statistical theory, and have applied theoretical methodologies to 
neural networks. 

It has been clarified recently that the usual statistical asymptotic theory 
on the MLE does not necessarily hold in neural networks (H!,|2j)- This always 
happens if we consider the model selection problem in neural networks. Assume 
that we have a neural network model with H hidden units as a hypothesis 
space, and that the target function can be realized by a network with a smaller 
number of hidden units than H. In this case, as we explain in Section 2, the true 
parameter in the hypothesis class, which realizes the target function, is high- 
dimensional and not identifiable. The distribution of the MLE is not subject to 



O. Watanabe, T. Yokomori (Eds.): ALT’99, LNAI 1720, pp. 51-|22I 1999- 
(c) Springer- Verlag Berlin Heidelberg 1999 



52 



Kenji Fukumizu 



the ordinary asymptotic theory in this case. We cannot apply any methods such 
as AIC and MDL, which are based on the asymptotic theory. 

In this paper, we discuss the MLE of linear neural networks, as the simplest 
multilayer model. Also in this simple model, the true parameter loses identifia- 
bility if and only if the target is realized by a network with a smaller number of 
hidden units. As the first step to investigate the behavior of a learning machine 
in unidentifiable cases, we calculate the expectation of the generalization error 
of linear neural networks in asymptotic situations, and derive an approximate 
formula for large-scale networks. From these results, we see that the generaliza- 
tion error in unidentifiable cases is larger than what is derived from the usual 
asymptotic theory. While the ordinary asymptotic theory asserts that the ex- 
pectation of the generalization error depends only on the number of parameters, 
the generalization error in linear neural networks depends on the rank of target 
function. 



2 Neural Networks and Identifiability 

2.1 Neural Networks and Identifiability of Parameter 

A neural network model can be described as a parametric family of functions 
— > K^}, where 0 is a parameter vector. A three-layer neural 
network with H hidden units is defined by 

j^l \k^l 

where 9 = (wij, rji^ , Ujk, Q) summarizes all the parameters. The function ip(t) is 
called an activation function. In the case of a multilayer perceptron, a bounded 
and non-decreasing function like tanh(t) is often used. 

We consider regression problems, assuming that an output of the target sys- 
tem is observed with a noise. An observed sample {x,y) X satisfies 

y = f{x) + z, ( 2 ) 

where f{x) is the target function, which is unknown to a learner, and z ~ 
N{Q,(j^Im) is a random vector representing noise, where N{y,,E) is a normal 
distribution with mean y, and variance-covariance matrix S. We use Im for the 
MxM unit matrix. An input vector x is generated randomly with its probability 
q{x)dx. A set of N training data, {{x^'^\y^''^)}^^i, is an independent sample 
from the joint probability p{y\x)q{x)dxdy , where p{y\x) is defined by eq.((2I). 

We discuss the maximum likelihood estimator (MLE) , denoted by 9, assum- 
ing the statistical model 



H 

f{x-9) = J2 



^ -k77„ (l<z<M) (1) 






^ exp(-^||y- 



p{y\x;9) 



(3) 



Generalization Error of Linear Neural Networks in Unidentifiable Cases 



53 




Fig. 1. Unidentifiable cases in neural networks 



which has the same noise model as the target. Under these assumptions, it is 
easy to see that the MLE is equivalent to the least square error estimator, which 
minimizes the following empirical error: 

N 

We evaluate the accuracy of the estimation using the expectation of generaliza- 
tion error. 



Egeri — E{(a;(i^)^y{..))}, 



Wf{x\e) - f{x)\\'^q{x)dx . 



( 5 ) 



We sometimes call it simply generalization error if not confusing. It is easy to 
see that the expected log likelihood is directly related to E^en as 



E{(o:(‘'),y(‘'))} 



p{y\x)q{x){- \ogp{y\x] 6))dydx 



2^2^9en 



const. (6) 



Throughout this paper, we assume that the target function is realized by the 
model; that is, there exists a true parameter 0q such that f{x; 9 q) = f{x). One 
of the special properties of neural networks is that, if the target can be realized 
by a network with a smaller number of hidden units than the model, the set 
of true parameters that realize the target function is not a point but a union 
of high-dimensional manifolds (Figni). Indeed, the target can be realized in the 
parameter set where wn = 0 (^i) holds and uik takes an arbitrary value, and 
also in the set where u\k = 0 and wn takes an arbitrary value, assuming 
(/?(0) = 0. We say that the true parameter is unidentifiable if the set of true 
parameters is a union of manifolds whose dimensionality is more than one. The 
usual asymptotic theory cannot be applied if the true parameter is unidentifiable. 
The Fisher information matrix is singular in such a case (|3|)- In tbe presence 
of noise in the output data, the MLE is located a little apart from the high- 
dimensional set. 
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2.2 Linear Neural Networks 

We focus on linear neural networks hereafter, as the simplest multilayer model. 
A linear neural network (LNN) with H hidden units is defined by 

f{x;A,B)=BAx, (7) 

where A is a, H x L matrix and B is a M x H matrix. We assume that 

H <M <L 

throughout this paper. Although f{x; A, B) is just a linear map, the model is 
not equal to the set of all linear maps from to , but is the set of linear 
maps of rank not greater than H . Then, the model is not equivalent to the linear 
regression model {Cx \ C \ M x L matrix}. This model is known as reduced 
rank regression in statistics (@]). 

The parameterization in eq.([3) has trivial redundancy. The transform 
(A,B) {GA, BG~^) does not change the map for any non-singular matrix 
G. Given a linear map of rank H, the set of parameters that realize the map 
consists of an X 77-dimensional manifold. However, we can easily eliminate this 
redundancy if we restrict the parameterization so that the first H rows of A make 
the unit matrix. Therefore, the essential number of parameters is H(L + M — H). 
In other words, we can regard BA as a point in an H (L + M — 77)-dimensional 
space. 

More essential redundancy arises when the rank of a map is less than H. 
Even if we use the above restriction, the set of parameters that realize such a 
map is still high-dimensional. Then, in LNN, the parameter BA is identifiable if 
and only if the rank of the target is equal to H . In this sense, we can regard linear 
neural networks as a multilayer model, because it preserves the unidentifiability 
explained in Section 2.1. 

If the rank of the target is H, the usual asymptotic theory holds. It is well 
known that Egeji is given by 

2 

Egen = ^x77(L + M-77) + 0(7V-3/2), (8) 

using the number of parameters. 

3 Generalization Error of Linear Neural Networks 

3.1 Exact Results 

It is known that the MLE of a LNN is exactly solved. We introduce the following 
notations; 

A = (a;W,...xW)^, F = (y W, . . . y and Z = {z^^\ . . . . 

(9) 
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Proposition 1 (O). Let Vh be an M x H matrix whose i-th column is the 
eigenvector corresponding to the i-th largest eigenvalue ofY'^X{X'^X)~^X"^Y. 
Then, the MLE of a linear neural network is given by 



BA = VhV^Y'^X (X'^X) \ 



( 10 ) 



Note that the MLE is unique even when the target is not identifiable, because 
the statistical data include noise. It distributes along the set of true parameters. 
The expectation of the generalization error is given by the following 

Theorem 1. Assume that the rank of the target is r (< H ), and the variance- 
covariance matrix of the input x is positive definite. Then, the expectation of the 
generalization error of a linear neural network is 



where <f>{p, n, q) is the expectation of the sum of the q largest eigenvalues of a 
random matrix following the Wishart distribution Wp{n;Ip). 

(The proof is given in Appendix.) 

The density function of the eigenvalues p,i >■■■> qip > 0 of Wp{n;Ip) is 
known as 



where is a normalizing constant. However, the explicit formula of 4>{p,n,q) 
is not known in general. In the following, we derive an exact formula in a simple 
case and an approximation for large-scale networks. 

We can exactly calculate ^(2,n, 1) as follows. Since the expectation of the 
trace of a matrix from the distribution W 2 (n;/ 2 ) is equal to 2, we have only 
to calculate E[/ii — /i 2 ]. By transforming the variable as r = and oj = 

cos“^ we can derive 




1 



1 



p 



Tl 1 




( 12 ) 




w(2r cos u})‘^2r sin ujduidr 



(13) 



Then, we obtain 




(14) 



From this fact, we can calculate the expectation of the generalization error in 
the case H = M — 1 and r = H — 1. 
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Theorem 2. Assume H = M — 1, Then, the expectation of the generalization 
error in the case r = H — 1 and r = H is give by 



E, 



•gen 



^ (^{M - 1)(T+ 1) - 1 + ^ if r = H-l (unidentifiable), 

^(M— 1)(L+1) if r = H (identifiable) . 

(15) 



The interesting point is that the generalization error changes depending on 
the identifiability of the true parameter. Since •\/7rT(^^^)/T(^) > 1 for n > 2, 
Egeri in the unidentifiable case is larger than E^en in the identifiable case. If the 
number of input units is very large, from the Stirling’s formula, the difference 
between these errors is approximated by ^\J'KLj2, which reveals much worse 
generalization in unidentifiable cases. 



3.2 Generalization Error of Large Scale Networks 

We analyze the generalization error of a large scale network in the limit when 
L, M, and H go to infinity in the same order. Let S ~ Wp(n; Ip) be a random 
matrix, and v\ > V 2 >■■■> Vp > Q he the the eigenvalues of n~^S. The 
empirical eigenvalue distribution of n~^S is defined by 

Pn.= -{,5{vi) + 5{v2)^ \-S{vp)), (16) 

P 

where 6{iy) is the Dirac measure at v. The strong limit of Pn is given by 

Proposition 2 (|^). Let 0<a<l. If n^oo,p^oo and p/n ^ a, then Pn 
converges almost everywhere to 

, , 1 \/ {U - Um)(uM - U) 

= x{u)du, (17) 

ZTra u 

where Um = {\/a — 1)^, um = (-\/a + 1)^> xi'u) denotes the characteristic 
function of [um , um] ■ 

FigureOshows the graph of Pa{u) for a = 0.5. 

We define up as the /3-percentile point of Pa{u); that is pa{u)du = (3. If 

we transform the variable as t = (u — ) / (2y/a), the density of t is 






2 VTEjfi 

7T 2^fio^t -t“ 1 -t“ I 



(18) 



and the /3-percentile point tp is given by Va{t)dt = /3. Then, we can calculate 



= Iu^Mu)du = i {cos ^{tp) - tp^l-tj'^ . (19) 

pjn^a. 



Combining this result with Theorem Q we obtain 
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Fig. 2. Density of eigenvalues po.b{u) 



Theorem 3. Let r (r < H ) be the rank of the target. Then, we have 

Egen ^ ^ |r(L + M -r) + {L- r){M - r)i (^cos"^(t/ 3 ) - tp^l- tjj | 

+ 0(iV-3/2), (20) 



when L, M, H,r ^ oo with M—L q, h r 

From elementary calculus, we can prove i{cos“^(t/ 3 ) — tjj^l — t|} > /3(1 + 
a(l — /3)) for 0 < a < 1 and 0 < /3 < 1. Therefore, we see that in unidentifiable 
cases {i.e. r < H), Egen is greater than ^H{L + M — H). Also in these results, 
Egen depends on the rank of the target. This shows clear difference from usual 
discussion on identifiable cases, in which Egen does not depend even on the model 
but only depends on the number of parameters (eq. (0 ) . 



3.3 Numerical Simulations 

First, we make experiments using LNN with 50 input, 20 hidden, and 30 output 
units. We prepare target functions of rank from 0 to 20. The generalization error 
of the MLE with respect to 10000 training data is calculated. The left graph 
of FiglSl shows the average of the generalization errors over 100 data sets and 
the theoretical results given in Theorem 0 We see that the experimental and 
theoretical results coincide very much. 

Next, we investigate the generalization error in an almost unidentifiable case, 
where the true parameter is identifiable but has very small singular values. We 
prepare a LNN with 2 input, 1 hidden, and 2 output units. The target function 
is f{x;9o) = where e is a small positive number. The target is 

identifiable for a non-zero e. The right graph of Fig0 shows the average of 
the generalization errors for 1000 training data. Surprisingly, while using 1000 
training data for only 3 parameters, the result shows that the generalization 
errors for small e are much larger than what is given by the usual asymptotic 
theory. They are rather close to Egen of the unidentifiable case marked by x . 
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Fig. 3. Experimental results 



4 Conclusion 

This paper discussed the behavior of the MLE in unidentifiable cases of multi- 
layer neural networks. The ordinary methods based on the asymptotic theory 
cannot be applied to neural networks, if the target is realized by a smaller num- 
ber of hidden units than the model. As the first step to clarifying the correct 
behavior of multilayer models, we elucidated the theoretical expression of the 
expectation of the generalization error for linear neural networks, and derived 
an approximate formula for large scale networks. From these results, we see that 
the generalization error in unidentifiable cases is larger than the generalization 
error in identifiable cases, and dependent on the rank of the target. This shows 
clear difference from ordinary models which always give a unique true parameter. 

Appendix 

A Proof of Theorem 1 

Let Co = BqAq be the coefficient of the target function, and E be the variance- 
covariance matrix of the input vector x. From the assumption of the theorem, 

5 is positive definite. The expectation of the generalization error is given by 

Egen = Ex,v[Tr[(.Bi - Co)S{BA - Co)^]]. (21) 

We define an M x L random matrix W by 

W = (22) 

Note that all the elements of W are subject to the normal distribution A^(0, cr^), 
mutually independent, and independent of X. From Proposition ^ we have 

bA-Co = {VhVS - Im)Co + VhVSW(X^X)-^^^. 



(23) 
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This leads the following decomposition; 

Egen = Ex.w^[Tr[Co^Co^ {Im ~ 

+ F.x,w[T^4yHVSW{X^X)-^E{X^X)-^W'^]]. (24) 

We expand and X^X as 

{x^xy^^ = Vns^^^ + f, 

X^x = NS + VNK. (25) 

Then, the matrices F and K are of the order 0(1) as N goes to infinity. We 
write e = hereafter for notational simplicity, and obtain the expansion of 
jjY'^X{X'^X)-^X'^Y as 

T{e) = ^Y'^ X{X'^ X)-^ X'^Y = (26) 

where 

= Coxc'y, 

= CqKcI + + ws^/^c'y, 

y(2) ^ y^y^T ^rpfJT ^ CqFW'^ . (27) 

Since the column vectors of Vh are the eigenvectors of T(e), they are obtained 
by the perturbation of the eigenvectors of Following the method of Kato 
( 0 , Section II), we will calculate the projection Pj{e) onto the eigenspace cor- 
responding to the eigenvalue Xj{e) of T(e), We call Pj{e) an eigenprojection. 

Let Ai > . . . > Ar > 0 be the positive eigenvalues of = CqECq , Pi 
(1 < i < r) be the corresponding eigenprojections, and Pq be the eigenprojec- 
tions corresponding to the eigenvalue 0 of Then, from the singular value 
decomposition of we see that there exist projections Qi {I < i < r) of 

IR^ such that their images are mutually orthogonal 1-dimensional subspaces and 
the equalities 



P,CoE^/^ = KQ, (28) 



hold for all i. We define the total projection Q by 

Q = El=lQ^■ (29) 

First, let Ai(e) (1 < i < r) be the eigenvalue obtained by the perturbation of 
Ai, and Pi{e) be the eigenprojection corresponding to Ai(e). Clearly, the equality 

Pye) = P, + 0{e) (30) 



holds. 
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Next, we consider the perturbation of Pq. Generally, by the perturbation of 
eq.® , the eigenvalue 0 of splits into several eigenvalues. Since the pertur- 
bation is caused by a positive definite random matrix, these eigenvalues are pos- 
itive and different from each other almost surely. Let Ar+i(£) > • ■ • > Xm{£) > 0 
be the eigenvalues, and Pr+j{s) be the corresponding eigenprojections. We define 
the total projection of the eigenvalues Xr+j by 

Poie)=E^=7Pr+M- (31) 

The non-zero eigenvalues of T{e)Po{s) are Ar+j(e) (1 < j < M — r). To obtain 
the expansion of Pr+j{s), we expand T{e)Po{e) as 

T{e)Po{e) = (32) 

Then, from Kato (0,(2.2O)), we see that the coefficient matrices of en. (i;t2ll are 
in general given by 

f (b = PoT^^^Po, 

f (2) = PqT^^'>Po - PqT^^'i PqT^^'i S - PqT^^'> ST^P P o - SpPPoTPPo, 
f (3) = -PqTP PqTP S - PoPPPoPPs - PoPPSpPPo - PoPPSpPPo 

- SpPPoPPPo - SpPPoPPPo + PoPPPoPPSpPS 

+ PoPPSpPPoPPS + PoPPSpPSpPPo + SpPPoPPPoPPS 
+ SpPPoPPSpPPo + SpPSpPPoPPPo - PqPP PqPP PqPP 

- PoPP PoS'^pP Po - PoPP S^PP PoPP Po - S'^pPPoPPPoPPPo, 

(33) 

where S is defined by 

which is the inverse of pP in the image oi I — Pq. Note that from eqs (ESJ, dZH]), 
and the equality 

= Q (35) 

holds. 

From the fact pPPo = 0, we have 

GoPo = 0. (36) 

Using eq. and eq. we can derive 

pP = 0, pP = Po{WW^ - WQW^)Po. (37) 

In particular, Pr+j (e) is the eigenprojection of 

\p{s)Po{e) = pP + ePP + s^pP + ■■■ . 



(38) 
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In the leading term PoW{Il — Q) is the orthogonal projection of W onto an 
M — r dimensional subspace in the range and onto an L — r dimensional subspace 
in the domain respectively. Thus, the distribution of the random matrix is 
equal to the Wishart distribution WM-r{L — r; 

We expand Pr+j{e) as 



Pr+ii.s) — Pr+j + sPr+j + P^+j + (39) 

From Kato (□, (2.14)), the coefficients are in general given by 



(3)5, - (3)p.+„ 

P^(2)^. = - (3)5, - 5,f(3)p^+j + P,.+,T'(3)5,f ( 3 ) 5 , + 5,f (3)p^+jf ( 3 ) 5 , 

I C.'f'(3) 0 . 71 ( 3 ) p_p .T'(3) p .pis) c2 _ p p(3) c2 p 

j. ^ J ^ P ~\~3 ^ ^ P ~\~3 ^ P~\~ 3 '^ ^ 3 ^ p ~^3 

- S]f^^'>Pr+jf^^'>Pr+j, (40) 

where 5, is defined by 



Sj=- 



E 

l<k<M—r 



1 



11 j - Ik 



-Pr+j {I — Po)- 

V] 



(41) 



Here r]i > ... > r]M-r are the non-zero eigenvalues of r(^) . The matrix 5, is 
equal to the inverse of r(3) — 77 , /m in the image of / — Pr+j. 

Using the expansions obtained above, we will calculate the generalization 
error. The first term of eq. m can be rewritten as 

ET=H+i-r^xMnCoSCj Pr+,{e)]]. (42) 

Having Pr+jCg = 0 in mind, we obtain 

Tr[CoUCo^PE] = 0’ 

Tr[CoUC(f Pj:%] = (/ - Po)f^^^ Pr+jf^^Kl ~ Po)]- (43) 

^j 

Using ea. lTTHIl . ea. lTTTIl . and the fact that CqSCq{I — Pq)S = I — Pq, we can 
derive from eq. (1^ 



Tr[CoEC^ Pj:‘^^j] = Tr[{T<^^'> PoT^"^^ - PqT<^^'> 

P^+,-(r(2)PoT(3) - r(3)5T(3)PoT(i))5]. (44) 



Furthermore, eq. (E3 leads 

pP'fPoT^^'fPr+j - T(3)PoT(3)5T(3)p^+, = CoUilU^Po(lUIU^ - WQW'^)Pr+j 

= 7JjCoS^W^Pr+j. (45) 
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Finally, we obtain 

Tr[CorCj = Pr[CoS^/‘^W^Pr+jWS^/‘^C'^ S] = Pr[P,+,WQW^]. (46) 

The random matrices Pr+j and VFQFF^ are independent, because WQ and 
W{Il — Q) are independent. Therefore, 

Px,w[T4CoSC^{Im - VhVS)]] = e^j:^JjJ_,^,Ex,wmPr+jWQW^]] 

= a^e‘^r{M - H) + O(e^) (47) 

is obtained. 

Using eqs. (E3 and the second term of eq. (E3) is rewritten as 



Ejf,M/[Tr[Uf/Vf)fU(X-' X)-^S(X^ X)"2 ]] 



= e'^Ex,w [El=lP4P^WW'^'] + Ef=TTr[P,+,TUTU'^lJ . (48) 

Because X)i=i Pi ^ non-random orthogonal projection onto an r-dimensional 
subspace and the distribution of each element of W is N(0,cr^), we have 



P‘X,W 



Tr 



= a'^rL. 



Y:i=^p^ww^ 

We can calculate the second part of the right hand side of eq. dtiSIl as 

Pv[Pr+jWW'^] = Tv[Pr+jWQW'^] + Pv[Pr+j{WW'^ - WQW'^)] 



= Pv[Pr+jWQW'^ 



Vj- 



(49) 



(50) 



Because r/j is the j-th largest eigenvalues of a random matrix from the Wishart 
distribution Wm-t{L — r, a'^lM-r), we obtain 

E;c.w^Ef=7Tr[P,+,TUW^]] = a^{r{H - r) + (j){M - r, L - r, H - r)}. (51) 

From eqs. dm), (PI), and dm, we prove the the theorem. □ 
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Abstract. The neuroidal tabula rasa (NTR) as a hypothetical device 
which is capable of performing tasks related to cognitive processes in the 
brain was introduced by L.G. Valiant in 1994. Neuroidal nets represent a 
computational model of the NTR. Their basic computational element is 
a kind of a programmable neuron called neuroid. Essentially it is a com- 
bination of a standard threshold element with a mechanism that allows 
modihcation of the neuroid’s computational behaviour. This is done by 
changing its state and the settings of its weights and of threshold in the 
course of computation. The computational power of an NTR crucially de- 
pends both on the functional properties of the underlying update mech- 
anism that allows changing of neuroidal parameters and on the universe 
of allowable weights. We will define instances of neuroids for which the 
computational power of the respective finite-size NTR ranges from that 
of finite automata, through Turing machines, upto that of a certain re- 
stricted type of BSS machines that possess super-Turing computational 
power. The latter two results are surprising since similar results were 
known to hold only for certain kinds of analog neural networks. 



1 Introduction 

Nowadays, we are witnessing a steadily increasing interest towards understand- 
ing the algorithmic principles of cognition. The respective branch of computer 
science has been recently appropriately named as cognitive computing. This no- 
tion, coined by L.G. Valiant 0, denotes any computation whose computational 
mechanism is based on our ideas about brain computational mechanisms and 
whose goal is to model cognitive abilities of living organisms. There is no sur- 
prise that most of the corresponding computational models are based on formal 
models of neural nets. 

Numerous variants of neural nets have been proposed and studied. They differ 
in the computational properties of their basic building elements, viz. neurons. 
Usually, two basic kinds of neurons are distinguished: discrete ones that compute 

* This research was supported by GA CR Grant No. 201/98/0717 



O. Watanabe, T. Yokomori (Eds.): ALT’99, LNAI 1720, pp. 63-[2| 1999. 
(c) Springer- Verlag Berlin Heidelberg 1999 



64 



Jiff Wiedermann 



with Boolean values, and analog (or continuous) ones that compute with any real 
or rational number between 0 and 1. 

As far as the computational power of the respective neural nets is concerned, 
it is known that the finite nets consisting of discrete neurons are computationally 
equivalent to finite automata (cf. 0). On the other hand, finite nets of analog 
neurons with rational weights, computing in discrete steps with rational values, 
are computationally equivalent to Turing machines (cf. P|)- If weights and com- 
putations with real values are allowed then the respective analog nets possess 
even super-Turing computational abilities No types of finite discrete neural 
nets are known that would be more powerful than the finite automata. 

An important aspect of all interesting cognitive computations is learning. 
Neural nets learn by adjusting the weights on neural interconnections according 
to a certain learning algorithm. This algorithm and the corresponding mechanism 
of weight adjustment are not considered as part of the network. 

Inspired by real biological neurons. Valiant suggested in 1988 P a special 
kind of programmable discrete neurons, called neuroids, in order to make the 
learning mechanism a part of neural nets. Based on its current state and current 
excitation from firings of the neighboring neuroids, a neuroid can change in the 
next step all its computational parameters (i.e., can change its state, threshold, 
and weights) . In his monograph P Valiant introduced the notion of a neuroidal 
tabula rasa (NTR). It is a hypothetical device which is capable of performing 
tasks related to cognitive processes. Neuroidal nets serve as a computational 
model of the NTR. Valiant described a number of neuroidal learning algorithms 
demonstrating a viability of neuroidal nets to model the NTR. Nevertheless, 
insufficient attention has been paid to the computational power of the respective 
nets. Without pursuing this idea any further Valiant merely mentioned that 
the computational power of neuroids depends on the restriction put upon their 
possibilities to self-modify their computational parameters. 

It is clear that by identifying a computational power of any learning device 
we get an upper qualitative limit on its learning or cognitive abilities. Depending 
on this limit, we can make conclusions concerning the efficiency of the device at 
hand and those related to its appropriateness to serve as a realistic model of its 
real, biological counterpart. 

In this paper we will study the computational power of the neuroidal tabula 
rasa which is represented by neuroidal nets. The computational limits will be 
studied w.r.t the various restrictions on the update abilities of neuroidal com- 
putational parameters. 

In Section 2 we will describe a broad class of neuroidal networks as introduced 
by Valiant in f]. 

Next, in Section 3, three restricted classes of neuroidal nets will be intro- 
duced. They will include nets with a finite, infinite countable (i.e, integer), and 
uncountable (i.e., real) universe of weights, respectively. 

Section 4 will briefly sketch the equivalence of the most restricted version of 
finite neuroidal nets — namely those with a finite set of parameters, with the 
finite automata. 
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In Section 5 we further show the computational equivalence of the latter 
neuroidal nets with the standard neural nets. 

The next variant of neuroidal nets, viz. those with integer weights, will be 
considered in Section 6. We will prove that finite neuroidal nets with weights 
of size S{n), which allow a simple arithmetic over their weights (i.e., adding or 
subtracting of the weights), are computationally equivalent to computations of 
any S'(n)-space bounded Turing machine. 

In Section 7 we will increase the computational power of the previously con- 
sidered model of neuroidal nets by allowing their weights to be real numbers. 
The resulting model will turn to be computationally equivalent to the so-called 
additive BSS machine (PJ). This machine model is known for its ability to solve 
some undecidable problems. 

Finally, in the conclusions we will discuss the merits of the results presented. 

2 Neuroidal Nets 

In what follows we will define neuroidal nets making use of the original Valiant’s 
proposal I?], essentially including his notation. 

Definition 2.1 A neuroidal net N is a quintuple M = (G, W, X, 6, A), where 

• G = {V, E) is the directed graph describing the topology of the network; V 
is a finite set of N nodes called neuroids labeled by distinct integers 1,2, . . . ,iV, 
and E is a set of M directed edges between the nodes. The edge {i,j) for i,j C 
{1, . . . , A^} is an edge directed from node i to node j . 

• W is the set of numbers called weights. To each edge (i,j) S E there is a 
value Wij C W assigned at each instant of time. 

• X is the finite set of the modes of neuroids which a neuroid can be in each 
instant. Each mode is specified as a pair (q,p) of values where q is the member 
of a finite set Q of states, and p is an integer from a finite set T called the set 
of thresholds of the neuroid. 

Q consists of two kinds of states called firing and quiescent states. 

To each node i there is also a Boolean variable fi having value one or zero 
depending on whether the node i is in a firing state or not. 

• 6 is the recursive mode update function of form 5 \ X y.W ^ X. 

Let Wi € W be the sum of those weights Wki of neuroid i that are on 
edges (k,i) coming from neuroids which are currently firing, i.e., formally Wi = 
Wfc tiring Wki = Y] 3 fiWji. The value of Wi is called the excitation of i at 

(fc.PEE (j,i)£E 

that time. 

The mode update function 6 defines for each combination {si,Wi) holding at 
time t the mode s' € X that neuroid i will transit to at time t+ 1.' 5{si, Wi) = s' . 

• X is the recursive weight update function of form X : X xWxWx {0, 1} ^ 
W . It defines for each weight Wji at time t the weight rcL to which it will transit 
at time t+1, where the new weight can depend on the values of each Si, Wi, Wji, 
and fj at time t: X{si,Wi,Wji, fj) = w'j^ 
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The elements of sets Q, T, W, and /i’s are called parameters of net N. 

A configuration of Af at time t is a list of modes of all neurons followed by a 
list of weights of all edges in Af at that time. The respective lists of parameters 
are pertinent to neuroids ordered by their labels and to edges ordered lexico- 
graphically w.r.t. the pair of labels (of neuroids) that identify the edge at hand. 
Thus at any time a configuration is an element from x . 

The computation of a neuroidal network is determined by the initial condi- 
tions and the input sequence. The initial conditions specify the initial values of 
weights and modes of the neuroids. These are represented by the initial config- 
uration. The input sequence is an infinite sequence of inputs or stimuli which 
specifies for each t = 0,1,2,... a set of neuroids along with the states into which 
these neuroids are forced to enter (and hence forced to fire or prevented from 
firing) at that time by mechanisms outside the net (by peripherals). 

Formally, each stimulus is an A^-tuple from the set {Q U *}^ ■ If there is a 
symbol q at z-th position in the t-th A^-tuple St, then this denotes the fact that 
the neuroid i is forced to enter state q at time t. The special symbol * is used as 
don’t-care symbol at positions which are not influenced by peripherals at that 
time. 

A computational step of neuroidal net Af, which finds itself in a configuration 
Ct and receives its input st at time t, is performed as follows. First, neuroids are 
forced to enter into states as dictated by the current stimuli. Neurons not influ- 
enced by peripherals at that time retain their original state as in configuration 
Ct- In this way a new configuration c( is obtained. Excitation Wi is computed for 
this configuration now and the mode and weight updates are realized for each 
neuroid i in parallel, in accordance with the respective function 8 and A. In this 
way a new configuration ct+i is entered. 

The result of the computation after the t-th step is the A^-tuple of states of 
all neuroids in Ct+i- This A^-tuple is called the action at time t. Obviously, any 
action is an element in . Then the next computational step can begin. 

The output of the whole computation can be seen as an infinite sequence of 
actions. 

From the computational point of view any neuroidal net can be seen as a 
transducer which reads an infinite sequence of inputs (stimuli) and produces an 
infinite sequence of outputs (actions). 

For more details about the model see [Z|. 



3 Variants of Neuroidal Nets 

In the previous definition of neuroidal nets we allowed set W to be any set of 
numbers and the weight and mode update functions to be arbitrary recursive 
functions. Intuitively it is clear that by restricting these conditions we will get 
variants of neural nets differing in their expressiveness as well as in their com- 
puting power. In his monograph Valiant [71 discusses this problem and suggests 
two extreme possibilities. 



The Computational Limits to the Cognitive Power 



67 



The first one considers such neuroidal nets where the set of weights of in- 
dividual neuroids is finite. This is called a “simple complexity-theoretic model” 
in Valiant’s terminology. We will also call the respective model of a neuroid as 
a “finite weight” neuroid. Note that in this case functions S and A can both be 
described by finite tables. 

The next possibility we will study are neuroidal nets where the universe of 
allowable weights and thresholds is represented by the infinite set of all integers. 
In this case it is no longer possible to describe the weight update function by a 
finite table. What we rather need is a simple recursive function that will allow ef- 
ficient weight modifications. Therefore we will consider a weight update function 
which allows setting a weight to some constant value, adding or subtracting the 
weights, and assigning existing weights to other inputs edges. Such a weight up- 
date function will be called a simple- arithmetic update function. The respective 
neuroid will be called an “integer weight” neuroid. The size of each weight will 
be given by the number of bits needed to specify the respective weight value. 
This is essentially a model that is considered in as the counterpart of the 
previous model. 

The final variant of neuroidal nets which we will investigate is the variant of 
the previously mentioned model with real weights. The resulting model will be 
called an additive real neuroidal net. 

4 Finite Weight Neuroidal Nets and Finite Automata 

It is obvious that in the case of neuroidal nets with finite weights there is but 
a final number of different configurations a single neuroid can enter. Hence its 
computational activities like those of any finite neuroidal net, can be described 
by a single finite automaton (or more precisely: by a finite transducer). In order 
to get some insight into the relation between the sizes of the respective devices 
we will describe the construction of the respective transducer in more detail in 
the next theorem. In fact this transducer will be a Moore machine (i.e., the type 
of a finite automaton producing an output after each transition) since there is 
an output (action) produced by M after each computational move. 

Theorem 4.1 Let M he a finite neuroidal net consisting of N neuroids with a 
finite set of weights. Then there is a constant c > 0 and a finite Moore automaton 
A of size 0{c^) that simulates Af. 

Sketch of the proof: We will describe the construction of the Moore automaton 
A = (I , S, qo, 0 , A). Here I denotes the input alphabet whose elements are V-tuples 
of the stimuli: I = {Q U *}^ . Set S is a set of states consisting of all configurations 
of A/”, i.e., S = X . State go is the initial state and it is equal to the initial 
configuration of Af. Set O denotes a set of outputs of A. It will consist of all possible 
actions of A/” : O = . 

The transition function A :lxS—>SxO is defined as follows: A(i, si) = (s2, o) if 
and only if the neuroid Af in configuration si and with input i will enter configuration 
S2 and produce output o in one computational move. It is clear that the input-output 
behaviour of both Af and A is equivalent. □ 
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Note that the size of the automaton is exponential w.r.t the size of the neu- 
roidal net. In some cases such a size explosion seems to be unavoidable. For in- 
stance, a neuroidal net consisting of N neuroids can implement a binary counter 
that can count up to , where c > 2 is a constant which depends on the number 
of states of the respective neuroids. The equivalent finite automaton would then 
require at least J7(c^) states. Thus the advantage of using neuroidal nets instead 
of finite automata seems to lie in the description economy of the former devices. 

The reverse simulation of a finite automaton by a finite neuroidal net is 
trivial. In fact, a single neuroid, with a single input, is enough. During the 
simulation, this neuroid transits to the same states as the simulated automaton 
would. There is no need for a neuroid to make use of its threshold mechanism. 

5 Simulating Neuroidal Nets by Neural Nets 

Neural nets are a restricted kind of neuroidal networks in which the neuroids can 
modify neither their weights nor their thresholds. The respective set of neuroidal 
states consists of only two states — of a firing and quiescent state. Moreover, 
the neurons are forced to fire if and only if the excitation reaches the threshold 
value. The computational behaviour of neural networks is defined similarly as 
that of the neuroidal ones. 

It has been observed by several authors that neural nets are also computa- 
tionally equivalent to the finite automata (cf. 0). Thus, we get the following 
consequence of the previous theorem: 

Corollary 5.1 The computational power of neuroidal nets with a finite set of 
weights is equivalent to that of standard non-programmable neural nets. 

In order to better appreciate the relationship between the sizes of the respec- 
tive neuroidal and neural nets, we will investigate the direct simulation of finite 
neuroidal nets with finite weights by finite neural nets. 

Theorem 5.1 Let Af = {G, W, X, 6, A) be a finite neuroidal net consisting of N 
neuroids and M edges. Let the set of weights of Af be finite. Let \L\, and \D\, 
respectively, be the number of all different sets of arguments of the corresponding 
weight and mode update function. Let IS'I be the set of all possible excitation 
values, |5|<2l’^l. 

Then N can be simulated by a neural network Af consisting o/0((|Jf|-|-|S'|-|- 
\L\ + \D\)N + |IF|M) neurons. 

Proof: It is enough to show that to any neuroid i of A/” an equivalent neural network Ci 
can be constructed. At any time the neuroid i is described by its “instantaneous de- 
scription”, viz. its mode and the corresponding set of weights. The idea of simulation 
is to construct a neural net for all combinations of parameters that represent a possible 
instantaneous description of i. The instantaneous value of each parameter will be rep- 
resented by a special module. There will also be two extra modules to realize the mode 
and weight update functions. Instead of changing the parameters the simulating neural 
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Fig. 1. Data flow in module C\ simulating a single neuroid 



net will merely “switch” among the appropriate values representing the parameters of 
the instantaneous description of the simulated neuroid. The details are as follows. 

Ci will consist of five different modules. 

First, there are three modules called mode module, excitation module, and weight 
module. The purpose of each of these modules is to represent a set of possible values of 
the respective quantity that represents, in order of their above enumeration, a possible 
mode of a neuroid, a possible value of the total excitation coming from the firings of 
adjacent neurons, and possible weights for all incoming edges. 

Thus the mode module Mi consists of \X\ neurons. For each pair of form [q,p) € X, 
with q £ Q and p £ T, there is a corresponding neuron in Mi. Moreover, the neuron 
corresponding to the current mode of neuroid i is firing, while all the remaining neurons 
in Mi are in a quiescent state. 

The weight module Wi consists of a two-dimensional array of neurons. To each 
incoming edge to i there is a row of |1T| neurons. Each row contains neurons corre- 
sponding to each possible value from the set W. If {i,j) £ F is an incoming edge to i 
carrying the weight Wij £ W, then the corresponding neuron in the corresponding row 
of Wi is firing. 

The excitation module Ei consists of 0(|S'|) neurons. Among them, at each time 
only one neuron is firing, namely the one that corresponds to the current excitation 
Wi of i. Let Wi = tiring Wk,i = 5^ j fjWji at that time. In order to compute 

{k,i)^E (j,i)GE 

Wi, we have to add only those weights that occur at the connections from currently 
firing neuroids. Therefore we shall first check all pairs of form {fj',Wji} to see which 
weight value Wji should participate in the computation of the total excitation. This 
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will be done by dedicating special neurons fijfc to this task, with k ranging over all 
weights in W. Each neuron tijk will receive 2 inputs. The first one from a neuron from 
the j-th row and fc-th column in the weight module, which corresponds to some weight 
w (z W. This connection will carry weight w. The other connection will come from Cj 
and will carry the weight 1. Neuron tijk will fire iff j is firing and the current weight 
of connection {j, i\ is equal to w. In other words, tijk will fire iff its excitation equals 
exactly w + 1. This calls for implementing an equality test which requires the presence 
of some additional neurons, but we will skip the respective details. The outcomes from 
all tijk are then again summed and tested for equality against all possible excitation 
values. In this way the current value of Wi is determined eventually and the respective 
neurons serve as output neurons of the excitation module. 

Besides these three modules there are two more modules that represent, and realize 
the transition functions 5 and A, respectively. 

The J-module contains one neuron for each set of arguments of the mode update 
function 5. Neuron d, responsible for the realization of the transition of form 5{si,Wi) = 
s'i, has its threshold equal to 2. Its incoming edges from each output neuron in Mi and 
from each output neuron from Ei carry the weight equal to 1. Clearly, d fires iff neurons 
corresponding to both quantities Si and Wi fire. Firing of e will subsequently Inhibit the 
firing of a neuron corresponding to Si and excite the firing of a neuron corresponding 
to Si- Moreover, if the state corresponding to s' is a firing state of i then also a special 
neuron out in Ci is made to fire. 

The A-module is constructed n a similar way. It also contains one neuron per each 
set of arguments of the weight update function A. The neuron t responsible for the 
realization of the transition of form X{si,Wi,Wji, fj) = w'ji has the threshold 4. Its 
incoming edges of weight 1 connect to it each output neuron in Mi, to each output 
neuron in Wi, to each neuron from the row corresponding to the j'-th incoming edge 
of i, in Ei, and to the output from Cj. Clearly, I fires iff neurons corresponding to all 
four quantities Si, Wi, Wji, and fj fire. Firing of £ will subsequently Inhibit the hring 
of a neuron corresponding to Wji in Wi and excite the firing of a neuron corresponding 
to Wji, also in Wi. 

Schematically, the topology of network Ci is sketched in Fig.l. For simplicity rea- 
sons only the flow of data is depicted by arrows. 

The size of Ci is given by the sum of all sizes of all its modules. The whole net 
N' thus contains N mode-, excitation-, 5- and A-modules, of size |Af|, IS], |D|, and 
\L\, respectively. Moreover, for each of M edges of N there is a complete row of \W\ 
neurons. This altogether leads to the size estimation as stated in the statement of the 
theorem. □ 

From the previous theorem we can see that the size of a simulating neural 
network is larger than that of the original neuroidal network. It is linear in both 
the number of neuroids and edges of the neuroidal network. The constant of 
proportionality depends linearly on the size of “program” of individual neuroids, 
and exponentially on the size of the universe of weights. However, note that the 
neural net constructed in the latter theorem is much smaller than that obtained 
via the direct simulation of the finite automaton corresponding to the simulated 
neuroidal net. A neural net, simulating the automaton from the proof of theorem 
O, would be of size 0{c^) for some constant c > 0. 

To summarize the respective results, we see that when comparing finite neu- 
roidal nets to standard, non-programmable neural nets, the programmability of 
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the former does not increase their computational power; it merely contributes 
to their greater descriptive expressiveness. 

6 Integer Weight Neuroidal Nets and Turing Machines 

Now we will show that in the case of integer weights there exist neuroidal nets 
of the finite size that can simulate any Turing machine. 

Since we will be interested in space-bounded machines w.l.o.g. we will first 
consider a single-tape Turing machine in place of a simulated machine. In order 
to extend our results also for sublinear space complexities we will later also 
consider single-tape machines with separate input tapes. 

First we show that even a single neuroid is enough for simulation of a single 
tape Turing machine. 

Theorem 6.1 Any single tape Turing machin^ of time complexity Tin) and of 
space complexity S(ji) can be simulated in time 0{T{n)S^{n)) by a single neuroid 
making use of integer weights of size 0{S(ji)) and of a simple arithmetic weight 
update function. 

Sketch of the proof: Since we are dealing with space-bounded Turing machines 
(TMs), w.l.o.g. we can consider only single-tape machines. Thus in what follows we 
will describe simulation of one computational step of a single-tape Turing machine At 
of space complexity S{n) with tape alphabet {0, 1}. It is known (cf. j2j) that the tape 
of such a machine can be replaced by two stacks, Sl and Sr, respectively. The first 
stack holds the contents of ATs tape to the left from the current head position while 
the second stack represents the rest of the tape. The left or the right end of the tape, 
respectively, find themselves at the bottoms of the respective stacks. Thus we assume 
that ATs head always scans the top of the right stack. For technical reasons we will 
add an extra symbol 1 to the bottom of each stack. During its computation M updates 
merely the top, or pushes the symbols to, or pops the symbols from the top of these 
stacks. 

With the help of a neuroid n we will represent machine Ad in a configuration 
described by the contents of its two stacks and by the state of the machine’s finite 
state control in the following way. The contents of both stacks will be represented by 
two integers vl and vr, respectively. Note that both vl,vr > 1 thanks to I’s at the 
bottoms of the respective stacks. The instantaneous state of Ad is stored in the states 
of n. 

To simulate Ad we merely have to manipulate the above mentioned two stacks in 
a way that corresponds to the actions of the simulated machine. Thus, the net has to 
be able to read the top element of a stack, to delete it (to pop the stack), and to add 
(to push) a new element onto the top of the stack. W.r.t. our representation of a stack 
by an integer v, say, reading the top of a stack asks for determining the parity of v. 
Popping an element from or pushing it to a stack means computing of \y/2\ and 2v, 
respectively. All this must be done with the help of additions and subtractions. 

The idea of the respective algorithm that computes the parity of any u > 0 is 
as follows. From v we will successively subtract the largest possible power of 2, not 

^ Note that in the case of single tape machines the input size is counted into the space 
complexity and therefore we have S(n)) > n. 
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greater than v, until the value of v drops to either 0 or 1. Clearly, in the former case, 
the original value of v was even while in the latter case, it was odd. 

More formally, the resulting algorithm looks as follows; 

while u > 1 do pi := l;p2 := 2; 

while p2 < V do pi := pi +pl;p2 := p2 +p2 od; 

V := V — pi 

od; 

if u = 1 then return(odd) else return(even) fi; 

By a similar algorithm we can also compute the value of [u/2j (i.e., the value of v 
shifted by one position to the right, losing thus its rightmost digit). This value is equal 
to the sum of the halves of the respective powers (as long as they are greater than 1) 
computed in the course of previous algorithm. 

The time complexity of both algorithms is 0(S'^(n)). 

The “neuroidal” implementation of previous algorithms looks as follows. The al- 
gorithms will have to make use of the values representing both stacks, vl and vr, 
respectively. Furthermore, they will need access to the auxiliary values pi, p2, v, and 
to the constant 1. All the previously mentioned values will be “stored” as weights of 
n. For technical reasons imposed by functionality restrictions of neuroids, which will 
become clearer later, we will need to store also the inverse values of all previously 
mentioned variables. These will be also stored in the weights of n. 

Hencefore, the neuroid n will have 12 inputs. These inputs are connected to n via 
12 connections. Making use of the previously introduced notation, the hrst six will hold 
the weights w\ = vl, W 2 = vr, ws = v, W 4 = pi, ws = p2, and wq = 1. The remaining 
six will carry the same but inverse values. 

The output of n is connected to all 12 inputs. 

The neuroid simulates each move of Af in a series of steps. Each series perform one 
run of the previously mentioned (or of a similar) algorithm and therefore consists of 
0{S^(n) steps. 

At the beginning of the series that will simulate the {t + l)-st move of A4, the 
following invariant is preserved by n for any t > 0. Weights wi and W 2 represent the 
contents of the stacks after the t-th move and wr = —wi and wg = —W 2 - The remaining 
weights are set to zero. 

At the beginning of computation, the left stack is empty and the right stack con- 
tains the input word of A4. We will assume that n will accept its input by entering a 
designated state. Also, the threshold of n will be set to 0 all the time. 

Assume that at time t the finite control of A4 is in state q. Until its change, this 
state is stored in all forthcoming neuroidal states that n will enter. 

In order to read the symbol from the top of the right stack the neuroid has to 
determine the last binary digit of W 2 or, in other words, it has to determine the parity 
of W 2 - To do so, we first perform all the necessary initialization assignments to auxiliary 
variables, and to their “counterparts” holding the negative values. In order to perform 
the necessary tests (comparisons), the neuroid must enter a firing state. Due to the 
neuroidal computational mechanism and thanks to the connection among the output 
of n and all its inputs, all its non-zero weights will participate in the subsequent 
comparison of the total excitation against n’s threshold. It is here that we will make a 
proper use of weights with the opposite sign: the weights (i.e., variables) that should 
not be compared, and should not be forgotten, will participate in a comparison with 
opposite signs. 
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For instance, to perform the comparison p2 < v, we merely “switch off’ the positive 
value of p2 and the negative value of v from the comparison by temporarily setting 
the respective weights ics and wg to zero. All the other weight values remain as they 
were. As a result, after the firing step, n will compare v — p2 against its threshold value 
(which is permanently set to 0) and will enter a state corresponding to the result of this 
comparison. After the comparison, the weights set temporarily to zero can be restored 
to their previous values (by assignments ws := — wn and wg := —ws). 

The transition of M into a state as dictated by its transition function is realized 
by Af after updating the stacks appropriately, by storing the respective machine state 
into the state of n. The simulation ends by entering into the final state. 

It is clear that the simulation runs in time as stated in the theorem. The size of 
any stack, and hence of any variable, never exceeds the value S(n) + 1. Hence the size 
of the weights of n will be bounded by the same value. □ 

Note that a similar construction, still using only one neuroid, would also 
work in case a multiple tape Turing machine should be simulated. In order to 
simulate a fc-tape machine, the resulting neuroid will represent each tape by 12 
weights as it did before. This will lead to a neuroid with 12fc incoming edges. 

Next we will also show that a simulation of an off-line Turing machine by a 
finite neuroidal network with unbounded weights is possible. This will enable us 
to prove a similar theorem as before which holds for arbitrary space complexities: 



Theorem 6.2 Let A4 be an off-line multiple tape Turing machine of space com- 
plexity S{n) > 0. Then A4 can he simulated in a cubic time by a finite neuroidal 
net that makes use of integer weights of size 0{S{n)) and of a simple arithmetic 
weight update function. 

Sketch of the proof: In order to read the respective inputs the neuroidal net will 
be equipped with the same input tape as the simulated Turing machine. Except the 
neuroid n that takes care of a proper update of stacks that represent the respective 
machine tapes, the simulating net will contain also two extra neuroids that implement 
the control mechanism of the input head movement. For each move direction (left or 
right) there will be a special — so-called move neuroid — which will fire if and only 
if the input head has to move in a respective direction. The symbol read by the input 
head will represent an additional input to neuroid n simulating the moves of A4. 

The information about the move direction will be inferred by neuroid n. As can be 
seen from the description of the simulation in the previous theorem, n keeps track on 
that particular transition of A4 that should be realized during each series of its steps 
simulating one move of A4. 

Since s can transmit this information to the respective move neuroids only via firing 
and cannot distinguish between the two target neuroids, we will have to implement a 
(finite) counter in each move neuroid. The counter will count the number of firings of s 
occurring in an uninterrupted sequence. Thus at the end of the last step of Ad’s move 
simulation (see the proof of the previous theorem) s will send two successive firings to 
denote the left move and three firings for the right move. The respective signals will 
reach both move neuroids, but with the help of counting they will find which of them 
is in charge for moving the head. Some care over synchronization of all three neuroids 
must be taken. □ 
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It is clear that the computations of finite neuroidal nets with integer weights 
can be simulated by Turing machines. Therefore the computational power of 
both devices is the same. 

In 1995, Siegelmann and Sonntag 0 proved that the computational power 
of certain analog neural nets is equivalent to that of Turing machines. They 
considered finite neural nets with fixed rational weights. At time t, the output 
of their analog neuron f is a value between 0 and 1 which is determined by 
applying a so-called piecewise linear activation function ^ to the excitation Wi 
of i at that time (see definition 12. Ill : (p : Wi ^ (Ojl)- For negative excitation, 
(j) takes the value 0, for excitation greater than 1 value 1, and for excitations 
between 0 and 1, 4>{wi) = Wi. The respective net computes synchronously, in 
discrete time steps. 

We will call the respective nets as synchronous analog neural nets. 

Siegelmann and Sonntag’s analog neural networks simulating a universal Tur- 
ing machine consisted of 883 neurons. This can be compared with the simple 
construction from Theorem 5.1 requiring but a single neuroid. Nevertheless, the 
equivalency of both types of networks with Turing machines proves the following 
corollary: 

Corollary 6.1 Finite synchronous analog neural nets are computationally equiv- 
alent to finite neuroidal nets with integer weights. 

7 Real Weight Neuroidal Nets and the Additive BSS 
Model 

Now we will characterize the computational power of neuroidal nets with real 
parameters. We will compare their efficiency towards a restricted variant of the 
BSS model. The BSS model (cf. P) is a model that is similar to RAM which 
computes with real numbers under the unit cost model. In doing so, all four ba- 
sic arithmetic operations of additions, subtractions, multiplication and division 
are allowed. The additive BSS model allows only for the former two arithmetic 
operations. 

Theorem 7.1 The additive real model of neuroidal nets is computationally equiv- 
alent to the additive BSS model working over binary inputs. 

Sketch of the proof: The simulation of a finite additive real model of neuroidal net 
A/” on the additive BSS model 13 is a straightforward matter. 

For the reverse simulation, assume that the binary input to A/” is provided to B by 
a mechanism similar to that from Theorem ih.'Zl One must first refer to the theorem 
(Theorem 1 in Chapter 21 in PJ) that shows that by a suitable encoding a computation 
of any additive machine can be done using a fixed finite amount of memory (in a finite 
number of “registers”, each holding a real number) without exponential increase in 
the running time. The resulting machine T is then simulated by M in the following 
way: The contents of finitely many registers of T are represented as (real) weights of a 
single neuroid r. Addition or subtraction of weights, as necessary, is done directly by 
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weight update function. A comparison of weights is done with the help of r’s threshold 
mechanism in a similar way to that in the proof of Theorem lti.1 1 To single out from 
the comparison the weights that should not be compared one can use a similar trick as 
in Theorem lti.fi to each such a weight a weight with the opposite sign is maintained. 

The power of finite additive neuroidal nets with real weights comes from their 
ability to simulate oracular or nonuniform computations. For instance, in PJ if is 
shown that the additive real BSS machines decide all binary sets in exponential 
time. Their polynomial time coincides with the nonuniform complexity class 
Pipoly. 

8 Conclusions 

The paper brings a relatively surprising result showing computational equiva- 
lence between certain kinds of discrete programmable and analog finite neural 
nets. This result offers new insights into the nature of computations of neural 
nets. 

First, it points to the fact that the ability of changing weights is not a con- 
dition sine qua non for learning. A similar effect can be achieved by making use 
of reasonably restricted kinds of analog neural nets. 

Second, the result showing computational equivalency of the respective nets 
supports the idea that all reasonable computational models of the brain are 
equivalent (cf. [0|). 

Third, for modeling of cognitive or learning phenomena, the neuroidal nets 
seem to be preferred over the analog ones, due to the transparency of their 
computational or learning mechanism. As far as their appropriateness for the 
task at hand is concerned, neuroidal nets with a finite set of weights seem to 
present the maximal functionality that can be achieved by living organisms. As 
mathematical models also more powerful variants are of interest. 
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Abstract. We prove a new combinatorial characterization of polyno- 
mial learnability from equivalence queries, and state some of its con- 
sequences relating the learnability of a class with the learnability via 
equivalence and membership queries of its subclasses obtained by re- 
stricting the instance space. Then we propose and study two models of 
query learning in which there is a probability distribution on the instance 
space, both as an application of the tools developed from the combina- 
torial characterization and as models of independent interest. 



1 Introduction 

The main models of learning via queries were introduced by Angluin 
In these models, the learning algorithm obtains information about the target 
concept asking queries to a teacher or expert. The algorithm has to output an 
exact representation of the target concept in polynomial time. Target concepts 
are formalized as languages over an alphabet. Frequently, it is assumed that the 
teacher can answer correctly two kinds of questions from the learner: membership 
queries and equivalence queried Unless otherwise specified, all our discussions 
are in the “proper learning” framework where the hypotheses come from the 
same class as the target concept. A combinatorial notion, called approximate 
fingerprints, turned out to characterize precisely those concept classes that can 
be learned from polynomially many equivalence queries of polynomial size ^ El • 
The essential intuition behind that fact is that the existence of queries that 
shrink the number of possibilities for the target concept by a polynomial factor 
is not only clearly sufficient, but also necessary to learn: if no such queries are 
available then adversaries can be designed that force any learner to spend too 

* Work supported in part by the EC through the Esprit Program EU BRA program 
under project 20244 (ALCOM-IT) and the EC Working Group EP27150 (NeuroColt 
II) and by the Spanish DGES PB95-0787 (Koala). 

^ Such a teacher is called sometimes “minimally adequate” . 
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many queries in order to identify the target. This intuition can be fully formalized 
along the lines of the cited works; the formalization can be found in Q- 

Hellerstein et al. gave a beautiful characterization of polynomially (EQ,MQ)- 
learnable representation classes |S|. They introduced the notion of polynomial 
certificates for a representation class TZ and proved that TZ is polynomially learn- 
able from equivalence and membership queries iff it has polynomial certificates. 

The first main contribution of this paper is to propose a new combinatorial 
characterization of learnability from equivalence queries, surprisingly close to 
certificates, and quite different (and also simpler to handle) than the approximate 
fingerprints: the strong consistency dimension, that one can see as the analog of 
the VC dimension for query models. 

Angluin |210| showed that, when only approximate identification is required, 
equivalence queries can be substituted by a random sample. Thus, a PAC learn- 
ing algorithm can be obtained from an exact learning algorithm that makes 
equivalence queries. In PAC learning, introduced by Valiant CH, one has to 
learn a target concept with high probability, in polynomial time (and, a for- 
tiori, from a polynomial number of examples), within a certain error, under 
all probability distributions on the examples. Because of this last requirement, 
to learn under all distributions, PAC learning is also called distribution-free, or 
distribution-independent, learning. Distribution-independent learning is a strong 
requirement, but it can be relaxed to define PAC learning under specific distri- 
butions, or families of distributions. Indeed, several concept classes that are not 
known to be polynomially learnable, or known not to be polynomially learnable 
if RP 7 ^ NP, turn out to be polynomially learnable under some fixed distribution 
or families of distributions. 

In comparison to PAC learning, one drawback of the query models is that they 
do not have this added flexibility of relaxing the “distribution-free” condition. 
The standard transformation sets them automatically at the “distribution-free” 
level. The second main contribution of this paper is the proposal of two learning 
models in which counterexamples are not adaptatively provided by a (helpful 
or treacherous) teacher, but instead are nonadaptatively sampled according to 
a probability distribution. 

We prove that the distribution-free form of one of these models exactly co- 
incides with standard learning from equivalence queries, while the other model 
is captured by the randomized version of the standard model. This allows us to 
extend, in a natural way, the query learning model to an explicit “distribution- 
free” setting where this restrictive condition can be naturally relaxed. Some of 
the facts that we prove of these new models make use of the consistency dimen- 
sion characterization proved earlier as the first contribution of the paper. 

Our notation and terminology is standard. We assume familiarity with the 
query-learning model. Most definitions will be given in the same section where 
they are needed. Generally, let A be a set, called instance space or domain in 
the sequel. A concept is a subset of A, where we prefer sometimes to regard C 
as a function from A to {0, 1}. A eoncept class is a set C C 2^ of concepts. An 
element of A is called an instance. A pair (x,b), where b G {0, 1} is a binary 
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label, is called example for concept C if C{x) = b. A sample is a collection of 
labeled instances. Concept C is said to be consistent with sample S if C{x) = b 
for all (x, b) G S. 

A representation class is a four-tuple TZ = {R, S). S and A are finite 

alphabets. Strings of characters in E are used to describe elements of the domain 
X, and strings of characters in A are used to encode concepts. R C A* is the 
set of strings that are concept encodings or representations. Let : R — > 2^ 
be a function that maps these representations into concepts over E. For ease of 
technical exposition, we assume that, for each r € R there exists some n > 1 such 
that <P{r) C A'". Thus each concept with a representation in R has a domain of 
the form A" (as opposed to domain A*)0 The set C = {^(r) : r G R} is the 
concept class associated with TZ. 

We define the define the size of concept C : E" — > {0, 1} w.r.t. representation 
class TZ as the length of the shortest string r G R such that C = ^(r), or as 
oo if C is not representable within TZ. This quantity is denoted by \C\n- With 
these definitions, C is a “doubly parameterized class”, that is, it is partitioned 
into sets Cn^m containing all concepts from C with domain A" and size at most 
TO. The kind of query-learning considered in this paper is proper in the sense 
that concepts and hypotheses are picked from the same class C . We will however 
allow that the size of an hypothesis exceeds the size of the target concept. The 
number of queries needed in the worst case to obtain an affirmative answer from 
the teacher, or “learning complexity”, given that the target concept belongs 
to Cn.m and that the hypotheses of the learner may be picked from Cn,M, is 
denoted by LC^(n,TO, M), where O specifies the allowed query types. In this 
paper, either O = EQ or O = {EQ, MQ). We speak of polynomial O-learnability 
if LC^(n, TO, M) is polynomially bounded in n, to, M . 

We close this section with the definition of a version space. At any interme- 
diate stage of a query-learning process, the learner knows (from the teacher’s 
answers received so far) a sample S for the target concept. The current version 
space V is the set of all concepts from C„,m which are consistent with S. These 
are all concepts being still conceivable as target concepts. 

2 The Strong Consistency Dimension and Its 
Applications 

The proof, as it was given in 0, of the characterization of (EQ,MQ)-learning 
in terms of polynomial certificates implicitly contains concrete lower and upper 
bounds on the number of queries needed to learn TZ. In Subsection o we make 
these bounds more explicit by introducing the so-called consistency dimension 
of TZ and writing the bounds in terms of this dimension (and some other param- 
eters associated with TZ) . In Subsection tZ.'A we define the notions of a “strong 

^ This is a purely technical restriction that allows us to present the main ideas in the 
most convincing way. It is easy to generalize the results in this paper to the case of 
domains with strings of varying length. 
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certificate” and of the “strong consistency dimension” and show that they fit the 
same purpose for EQ-learning as the former notions did for (EQ,MQ)-learning: 
we derive lower and upper bounds on the number of EQs needed to learn TZ in 
terms of the strong consistency dimension and conclude that TZ is polynomially 
EQ-learnable iff it has a polynomial strong certificate. In Subsection o we 
prove that the strong consistency dimension of a class equals the maximum of 
the consistency dimensions taken over all subclasses (induced by a restriction 
of the domain). This implies that the number of EQs needed to learn a con- 
cept class roughly equals the total number of EQs and MQs needed to learn the 
hardest subclass. 

For ease of technical exposition, we need the following definitions. A partially 
defined concept C on domain A" is a function from if" to {0, 1, *}, where 
stands for “undefined”. Since partially defined concepts and samples can be 
identified in the obvious manner, we use the terms “partially defined concept” 
and “sample” interchangeably in the sequel. The support of C is defined as 
supp(C) = {x G A" : C{x) G {0, 1}}. The breadth of C is defined as the 
cardinality of its support and denoted as ICI. The size of C is defined as the 
smallest size of a concept that is consistent with C. It is denoted as \C\n- Note 
that this definition coincides with the previous definition of size when C has full 
support A”. Sample Q is called subsample of sample C (denoted as Q C C) 
if supp(Q) C supp(C) and Q,C coincide on supp(Q). Throughout this section, 
TZ — (A, Z\, R, pL) denotes a representation class defining a doubly parameterized 
concept class C. 

2.1 Certificates and Consistency Dimension 

7Z has polynomial certificates if there exist two-variable polynomials p and q, 
such that for all TO, n > 0, and for all C : A” — > {0,1} the following condition is 
valid: 

\C\n > p{n, to) ^ (3Q QC : \Q\ < q{m, n) A \Q\n > to) (1) 

The consistency dimension of TZ is the following three-variable function: 
cdim 7 ^(n, TO, M), where M > to > 0 and n > 0, is the smallest number d 
such that for all C : A” — > (0, 1} the following condition is valid: 

|C|7?,>M^(3QCC:|Q|<dA|Q|7^>TO) (2) 

An obviously equivalent but quite useful reformulation of Condition Q is 

(VQ E C : IQI < c? \Q\n < to) \C\n < M. (3) 

In words: if each subsample of C : A" ^ (0, 1} of breadth at most d has a con- 
sistent representation of size at most to, then C has a consistent representation 
of size at most M. 

The following result is (more or less) implicit in |B|. 

Theorem 1 

cdimTi{n,m, M) < {n,m,M) < \cdim'ji{n,m,M) -loglCn^mll 
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Note that the lower and the upper bound are polynomially related because 

log m • log(l -b |/\|). (4) 

Clearly, Theorem [D implies that TZ is polynomially (EQ,MQ)-learnable iff it has 
polynomial certificates. We omit the proof of Theorem [D 

2.2 Strong Certificates and Strong Consistency Dimension 

We want to adapt the notions “certificate” and “consistency dimension” to the 
framework of EQ-learning. Surprisingly, we can use syntactically almost the 
same notions, except for a subtle but striking difference: the universe of C will 
be extended from the set of all concepts over domain T’" to the corresponding 
set of partially defined concepts. This leads to the following definitions. 

TZ has polynomial strong certificates if there exist two-variable polynomials 
p and q, such that for all m,n > 0, and for all C : T’" ^ {0, 1, *} Condition (P) 
is valid. 

The strong consistency dimension of TZ is the following three-variable func- 
tion: scdim 7 ^(n, m, M), where M > m > 0 and n > 0, is the smallest number d 
such that for all C : if" — > {0, 1, *} Condition (EJ) is valid. Again, instead of Con- 
dition (|2I), we can use the equivalent Condition (0. In words: if each subsample 
of C : A” ^ {0,1,*} of breadth at most d has a consistent representation of 
size at most m, then C has a consistent representation of size at most M . 

Theorem 2 scdimTi{n^m,M)<LC^ {n,m,M)<\scdim'n{n,m,M) ■\TL\Cn,mW 

Proof. For brevity reasons, let q = (n, m, M) and d = scdim 7 ^(n, m, M). 

We prove the first inequality by exhibiting an adversary that forces any 
learner to spend as many queries as given by the strong consistency dimension. 
The minimality of d implies that there is a sample C such that still \C\ti > M 
but (VQ C C : IQI < d — 1 =b \Q\n < w)- Thus, any learner, issuing up to d — 1 
equivalence queries with hypotheses of size at most M, fails to be consistent 
with C, and a counterexample from C can be provided such that there is still 
at least one consistent concept of size at most m (a potential target concept). 
Hence, at least d queries go by until an affirmative answer is obtained. 

In order to prove q < [dln|C„^m|l, "''^6 describe an appropriate EQ-learner 
A. A keeps track of the current version space V (which is Cn,m initially). For 
f = 0, 1, let Sy be the set 

{x G A" : the fraction of concepts C G V with C{x) = 1 — i is less than 1/d}. 

In other words, a very large fraction (at least 1 — 1/d) of the concepts in V 
votes for output label i on instances from Sy. Let C\> be the sample assigning 
label i G {0, 1} to all instances from S^, and label “*” to all remaining instances 
(those without a so clear majority). Let Q be an arbitrary but fixed subsample 
of Cv such that \Q\ < d. The definition of S\; implies (through some easy-to- 
check counting) that there exists a concept C GV Q Cn,m that is consistent with 
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Q. Applying Condition Q, we conclude that |Cv|t?, < M, i.e., there exists an 
H G Cn,M that is consistent with Cy- The punchline of this discussion is: if A 
issues the EQ with hypothesis H, then the next counterexample will shrink the 
current version space by a factor 1 — 1/ d (or by a smaller factor) . Since the initial 
version space contains \Cn,m\ concepts and since A is done as soon as |V| < 1, a 
sufficiently large number of EQs is obtained by solving 

{l-l/dY\Cn,m\<e-^/’^\Cn,m\ < 1 

for q. Clearly, q= [din is sufficiently large. • 

Since the lower and the upper bound in Theorem El are polynomially related 
according to Inequality we obtain 

Corollary 3 TZ is polynomially EQ-learnable iff it has a polynomial strong cer- 
tificate. 

2.3 EQs Alone versus EQs and MQs 

The goal of this subsection is to show that the number of EQs needed to learn 
a concept class is closely related to the total number of EQs and MQs needed 
to learn the hardest subclass. The formal statement of the main results requires 
the following definitions. 

Let S = {Sn)n>i with Sn C A” be a family of subdomains. The restriction 
of a concept C : A" ^ {0, 1} to Sn is the partially defined concept (sample) 
with support Sn which coincides with C on its support. The class containing all 
restrictions of concepts from C to the corresponding subdomain from S is called 
the subclass of C induced by S and denoted as C|5. 

The notions polynomial certificate, consistency dimension, and learning com- 
plexity are adapted to the subclass of C induced by S in the obvious way. TZ\S (in 
words: TZ restricted to S) has polynomial certificates if there exist two- variable 
polynomials p and q, such that for all m,n > 0, and for all C : A” — > {0, 1, *} 
such that supp(C) = Sn, Condition (0 is valid. The consistency dimension of 
TZ\S is the following three- variable function: cdim 7 j(S'„, m, M) is the smallest 
number d such that for all M > m > 0,n > 0, and for all C : A" ^ {0, 1,*} 
such that supp(C) = Sn, Condition (EJ is valid. Again, instead of Condition 0, 
we can use the equivalent Condition 0- 

Quantity LC:^*^’^'^(S'„, to, M) is defined as the smallest total number of EQs 
and MQs needed to learn the class of concepts from Cn,m restricted to Sn with 
hypotheses from C„,m restricted to Sn- Quantity LC:^'^(S'„, to, M) is understood 
analogously. Note that 

l.C^{Sn,m,M) <l.C^{n,m,M) (5) 

is valid in general, because EQs become more powerful (as opposed to MQs which 
become less powerful) when we pass from the full domain to a subdomain (for the 
obvious reasons). We have the analogous inequality for the strong consistency 
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dimension, but no such statement can be made for or the consistency 

dimension. 

The following result is a straightforward generalization of Theorem ^ 

Theorem 4 cdimTi{Sn,'rn, M) < LC^'^^{Sn ,m,M) < \ cdim-fi{Sm'rn, M) ■ 
log|(C|5)„.^|l. 

We now turn to the main results of this section. The first one states that 
the strong consistency dimension of a class is the maximum of the consistency 
dimensions taken over all induced subclasses: 

Theorem 5 scdimji(n,m, M) = maxsci:" cdim'fi{S,m, M). 

Proof. Let d* be the smallest d which makes Condition 0 valid for all 
C : 17” ^ {0,1,*}. Let d*(«S') be the corresponding quantity when C ranges 
only over all samples with support S. It is evident that d* = maxsci:" dif{S). 
The theorem now follows, because by definition d* = scdim 7 ^(n, m, M) and 
d*(S') = cdim 7 j(iS', m, M). • 



Corollary 6 1. A representation class TZ has a polynomial strong certificate 

iff all its induced subclasses have a polynomial certificate. 

2. A representation class is polynomially EQ-learnable iff all its induced sub- 
classes are polynomially (EQ,MQ)-learnable. 



The third result states that the number of EQs needed to learn a class equals 
roughly the total number of EQs and MQs needed to learn the hardest induced 
subclass. 



Corollary 7 maxsci:" {S,m,M) < LCl^ {n,m, M) and 

LC^^{n,m,M)< [in • maxsci:" m, M) . 



Remember that the gap In |C„,m| is bounded above by m • ln(l -|- |Z\|). 



3 Equivalence Queries with a Probability Distribution 

Let now T> denote a class of probability distributions on X, the instance space 
for a computational learning framework. The two subsections of this section 
introduce respective variants of equivalence query learning that somehow take 
such distributions into account. 

We briefly describe now the first one. In the ordinary model of EQ-learning 
C, with hypotheses from TL, the counterexamples for incorrect hypotheses are 
arbitrarily chosen, and we can think of an intelligent adversary making these 
choices. EQ-learning C from V-teachers (still with hypotheses from 7d) proceeds 
as ordinary EQ-learning, except for the following important differences: 

1. Each run of the learning algorithm refers to an arbitrary but fixed pair (C, D) 
such that C G C and D G T>, and to a given confidence parameter 0 < (5 < 1. 
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2. The goal is to learn C from the D-teacher, i.e., C is considered as target 
concept (as usual), and the counterexample to an incorrect hypothesis H 
is randomly chosen according to the conditional distribution D{-/C 0 H), 
where 0 denotes the symmetric difference of sets. Success is defined when 
this symmetric difference has zero probability. The learner must achieve a 
success probability of at least 1 — i5. 

Clearly, the more restricted the class T> of probability distributions, the easier 
the task for the learner. In this extended abstract, we focus on the following 
three choices of T>. 

— T>au denotes the class of all probability distributions on X . This is the most 
general case. 

“ T^unif denotes the class of distributions that are uniform on a subdomain 
sex and assign zero probability to instances from X \ S. This case will 
be relevant in a later section. 

— T> — {D} is the most specific case, where T> constains only a single probability 
distribution D. We use it only briefiy in the last section. 

Loosely speaking, the main results of this section are as follows: 

— The next subsection proves that, for T> = Vaii, EQ-learning from U-teachers 
is exactly as hard (same number of queries) as the standard model. (This 
result is only established for deterministic learners.) Thus, we are not actu- 
ally introducing yet one more learning model, but characterizing an existing, 
widely accepted, one in a manner that provides the additional flexibility of 
the probability distribution parameter. Thus we obtain a sensible definition 
of distribution-dependent equivalence-query learning. 

— In the next section, we introduce a combinatorial quantity, called the sphere 
number, and show that it represents an information-theoretic barrier in the 
model of EQ-learning from U„„iy-teachers (even for randomized learning 
algorithms). However, this barrier is overcome for each fixed distribution D 
in the model of EQ-learning from the U-teacher. 

3.1 Random versus Arbitrary Counterexamples 

We use upper index EQ\V] to indicate that the U-teacher for some D G T> plays 
the role of the EQ-oracle. 

Theorem 8 (C, 7t) = L(f^{C,H). 

Proof. Let A be an algorithm which EQ-learns C from U-teachers with hy- 
potheses from 71. Let I > LC^*^(C, H) be the largest number of EQs needed by A 
when we allow an adversary to return arbitrary counterexamples to hypotheses 0 

® For the time being, there is no guarantee that A succeeds at all, because it expects 
the counterexamples to be given from a JJ-teacher. We will however see subsequently 
that there exists a distribution which sort of simulates the adversary. 
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Since LC'®*^(C) is defined taking all algorithms into account, we lose no generality 
in assuming that A always queries hypotheses that are consistent with previous 
counterexamples, so that all the counterexamples received along any run are 
different. There must exist a concept C € C, hypotheses Hq, ■ ■ ■ ,Hi -2 & C and 
instances xq, . . . ,xi -2 G X, such that the learner issues the I — 1 incorrect hy- 
potheses Hi when learning target concept C, and the Xi are the counterexamples 
returned to these hypotheses by the adversary, respectively. We claim that there 
exists a distribution D such that, with probability 1— 15, the Z?-teacher returns the 
same counterexamples. This is technically achieved by setting D{xi) = (1 — a)a*, 
for t = 0, ...,? — 3, and D{xi- 2 ) = ■ An easy computation shows that the 

probability that the D-teacher presents another sequence of counterexamples as 
the adversary is at most {I — 2)a. Setting a = S/{1 — 2), the proof is complete. • 

Therefore, the distribution-free case of our model coincides with standard 
EQ-learning. 

Corollary 9 Let TZ = (A, Z\, R, /i) be a representation class defining a doubly 
parameterized concept class C. Then \n,m,M) = LC^{n, m, M) for 

all M > m > 0, n > 0. 

This obviously implies that learners for the distribution-free equivalence 
model can be transformed, through the standard EQ model, into distribution- 
free PAC learners. We note in passing that, applying the standard techniques 
directly on our model, we can prove the somewhat stronger fact that, for each 
individual distribution T>, a learner from I?-teachers can be transformed into an 
algorithm that PAC-learns over T>. We also can assume knowledge of a bound 
on the size of the target concept, by applying the usual trick of guessing it and 
increasing the guess whenever necessary. 

3.2 EQ-Learning from Random Samples 

In this subsection, we discuss another variant of the ordinary EQ-learning model. 
Given a representation class C, EQ-learning from V-samples of size p and with 
hypotheses from TL proceeds as ordinary EQ-learning, except for the following 
differences: 

1. Each run of the learning algorithm refers to an arbitrary but fixed pair (C, D) 
such that C G C and D gT>, and to a given confidence parameter 0 < <5 < 1. 

2. The goal of the learner is to learn C from (ordinary) EQs and a sample P 
consisting of p examples drawn independently at random according to D 
and labeled correctly according to C. In other words, instead of EQ-learning 
C from scratch, the learner gets P as additional input. The learner must 
obtain an affirmative answer with a probability at least 1 — <5 of success. 

Again the goal is to output a hypothesis for which the probability of disagreement 
with the target concept is zero; this time, the information about the distribution 
does not come from the counterexamples, but rather from the initial additional 
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sample. We will show in this section that, for certain distributions, this model is 
strictly weaker than the model of EQ-learning from U-teachers. However, in the 
distribution-free sense, it corresponds to the randomized version of the model 
described previously. 

We first state (without proof) that each algorithm for EQ-learning from T>- 
samples can be converted into a randomized algorithm for EQ-learning from 
U-teachers, such as those of the previous section, at the cost of a moderate 
overhead in the number of queries. 

Theorem 10 Let q be the number of EQs needed to learn C from V-samples of 
size p and with hypothesis from hi. It holds, < {p + l)(p+ q)- 

We show next an example that has an identification learning algorithm in 
the EQ from U-teachers learning model, but does not have such algorithm in 
the EQ learning from U-samples model. 

A DNF„ formula is any sum -I- ^2 + • • • + of monomials, where each 
monomial U is the product of some literals chosen from {xi, . . . ,Xn,xi, . . . ,Xn}- 
Let DNF = U„ DNF„ be the representation class of disjunctive normal form 
formulas. 

Let us consider the class V of distributions D defined in the following way. 
Assume that two different words x„ and y„ have been chosen for each n > 1. 
Consider the associated distribution D defined by: 



D{Xn) = &lTT^{lln^ - 1/2") 
iJ(2/„) = 6/(7 t22") 

D{zn) = 0 for any word of length n different from Xn and y„. 

T> is obtained by letting Xn and run over all pairs of different words of 
length n. 

Let C be now any class able to represent concepts consisting of pairs {xn,yn} 
within a reasonable size; for concreteness, pick DNF formulas consisting of com- 
plete minterms. A very easy algorithm learns them in our model of EQ from 
U-teachers. The algorithm has to do at most two equivalence queries to know 
the value of the target formula / on x„ and First, it asks whether / is iden- 
tically zero. If a counterexample e is given — e must be x„ or y „ — it will make 
a second query / = tgT, where te is the monomial that only evaluates to one on 
e (the minterm). Thus we find whether either or both of /(a;„) and f{yn) are 
1, and if so we also know x„ and/or themselves. Now the target formula is 
identified: the value of the formula on other points does not matter because they 
have zero probability. 

However, it is not difficult to see that there is a distribution D G T> such 
that DNF formulas are not identifiable in the model of learning from EQ and 
ZJ-samples. Here we refer to learning DNF’s of size polynomial in n from poly- 
nomially many equivalence queries of polynomial size, and with an extra initial 
sample of polynomial breadth. First we note that sampling according to D G T> 
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there is a non-negligible probability of obtaining a sample that only contains 
copies of Xn- 

Lemma 11 For any polynomial q and 0 < (5 < 1, there exists an integer ko such 
that for all n > the probability that a D-sample S of size q{n, 1/5) does not 
contain yn is greater than 5. 

Then, the following negative result follows: 

Theorem 12 There exist a distribution D in T> such that DNF is not EQ learn- 
able from D-samples. 

The essential idea of the proof is that, after an initial sample revealing a 
single word, the algorithm is left with a task close enough to that of learning 
DNFs in the standard model with equivalence queries, which is impossible 

4 The Sphere Number and Its Applications 

The remainder of the paper uses the machinery developed in Section to obtain 
stronger results relating the models of the previous section, under one more 
technical condition: that the learning algorithm knows the size of the target 
concept, and never queries hypotheses longer than that. Some important learning 
algorithms do not have this property, but there are still quite a few (among the 
exact learners from equivalence queries only) that work in sort of an incremental 
fashion that leads to this property. The results become interesting because they 
lead to a precise characterization of randomized learners from D-teachers. 

We first rewrite our combinatorial material of the previous section in an 
extremely useful, geometrically intuitive form (1-spheres), and prove that for 
m = M these structures capture clearly the strong consistency dimension. Ap- 
plications follow in the next subsection. 



4.1 Strong Consistency Dimension and 1-Spheres 

A popular method for getting lower bounds on the number of queries is to show 
that the class of target concepts contains a basic “hard-to-learn” combinatorial 
structure. For instance, if the empty set is not representable but N singletons 
are, then the number of EQs, needed to identify a particular singleton, is at least 
N . In this Subsection, we consider a conceptually similarly simple structure: the 
so-called 1-spheres. They are actually a disguised (read isomorphic) version of 
sets of singletons, with the empty set simultaneously forbidden. Then we show 
that the strong consistency dimension is lower bounded by the size of the largest 
1-sphere that can be represented by C. Moreover, for M = m both quantities 
coincide. 

To make the last statements precise, we need several definitions. Let S' be a 
finite set, and So C S. The 1-sphere with support S around center So, denoted 
as F[g{So) in the sequel, is the collection of sets Si C S such that |So © Si| = 1, 
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where 0 denotes the symmetric difference of sets. In other words, S\ C S belongs 
to Hg(So) if the Hamming distance between So and is 1. Thus, it is formed 
by all the points at distance (radius) 1 from the center in Hamming space. 

Let us now assume that S C Let S' be an arbitrary subset of S. The 
sample C : T’" ^ {0, 1,*} which represents S' (as a subset of S) is the sample 
with support S that assigns label 1 to all instances from S' , and label 0 to all 
instances from S \ S'. We say that Hg{So) is representable by Cn,[m:M] if the 
following two conditions are valid: 

(A) Let Co be the sample with support S which represents So- Then, |Co|t?, > 
M. 

(B) Each sample Ci with support S, which represents a set G Hg{So), 
satisfies \Ci\n < m. 

Thus, for the particular case oi M = m, all points in Hamming space on the 
surface of the sphere are representable within size m but the center is not; just 
as the above-mentioned use of singletons, which form the 1-sphere centered on 
the empty set. The size of Hg{So) is defined as [S']. We define the three-variable 
function sphji{n,m, M), called sphere number ofTZ in the sequel, as the size of 
the largest 1-sphere which is representable by Cn^[m-.M]- 

We now turn to the main result of this subsection, which implies that the 
sphere number is another lower bound on LC^'^ {n,m, M). 

Theorem 13 sphji(ji,m, M) < scdimn{n^m^M) with equality for M = m. 

Proof. For the sake of brevity, let d = scdim 7 ^(n, m, M) and s = sph 7 ^(n, m, M). 

Let Hg{So) be a largest 1-sphere that is representable by Cn,[m-.M]- Thus, 
jS”! = s. In order to prove d > s, we assume for sake of contradiction d < s. 
Consider the sample Co with support S that represents So- By Condition (A), 

I Co Ik > M. According to Condition (E) applied to Cq, there exists a subsample 
Q CCq such that |Q| < d < s and \Q\n > m. Let Sq = supp(Q) C S. Let Qi be 
a sample with support S that totally coincides with Q (and thus with Cq) on 
and coincides with Co on S\Sq except for one instance. Clearly, Qi represents a 
set Si G Hg{So). By Condition (B), |Qi|k < m. Since \Q\n < |Qi|k, we arrived 
at a contradiction. 

Finally, we prove s> d for the special case that M = m. It follows from the 
minimality of d and Condition Q that there exists a sample C : A" — > {0,1,*} 
such that the following holds: 

1. |C|k > m. 

2. 3Qo E C : IQol E d A |Qo|k > "or 

3. VQ E C : (|g| < d - 1 ^ \Q\n < m). 

Let S denote the support of Qo- Note that jS”! = d (because otherwise the last 
two conditions become contradictory). Let Sq C S he the set represented by 
Qo- We claim that Hg{So) is representable by (which would conclude 

the proof). Condition (A) is obvious because |Qo|k > Condition (B) can be 
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seen as follows. For each x G S, define Qx as the subsample of C with support 
S'\ {x}, and Q'x: as the sample with support S that coincides with C on S'\ {x}, 
but disagrees on x. Because each is a subsample of C of breadth d — 1, it 
follows that IQxIt?, < m for all x G S. We conclude that the same remark applies 
to samples Q^, since a concept that is consistent with Qx, but inconsistent with 
Qo, must be consistent with Q'^. Finally note that the samples Q'^, x G S, are 
exactly the representations of the sets in Hg{So), respectively. • 



4.2 Applications of the Sphere Number 

In this subsection, C denotes a concept class. The main results of this section are 
derived without referring to a representation class TZ. We will however sometimes 
apply a general theorem to the special case where the concept class consists of 
concepts with a representation of size at most m. 

It will be convenient to adapt some of our notations accordingly. For instance, 
we say that 1-sphere Hg(So) is representable by C if S C X and the following 
two conditions are valid: 

(A) C does not contain a hypothesis H that assigns label 1 to all instances in 
So and label 0 to all instances in S\Sq. 

(B) For each S' G Hg{So), there exists a concept C G C that assigns label 1 to 
all instances in S' and label 0 to all instances in S\ S' . 

The following notation will be used in the sequel. If 5 = {xi,...,Xs}, then 
Si = So (B {xi} for i = I,...,s. Thus, Si,...,Ss are the sets belonging to 
Hg{So)- The concept from C which represents Si in the sense of Condition (B) 
is denoted as Ci. 

The sphere number associated with C, denoted as sph(C), is the size of the 
largest 1-sphere that is representable by C. Similar conventions are made for the 
learning complexity measure LC. 

Theorem 14 Let C = Hg{So) be a 1-sphere and D an arbitrary but fixed dis- 
tribution on S. Then, <1-1- |"log(l/(5)] . 

Proof. Let S = {xi, . . . , Xg}, and let Ci, . . . , Cg be the concepts from C used 
to represent S\,. . . ,Ss G Hg{So), respectively. Let Hi, . . . , Hs he a, permutation 
of Cl, . . . , Cg sorted according to increasing values of D(xi). Consider the EQ- 
learner which issues its hypotheses in this order. It follows that as long as there 
exist counterexamples of a strictly positive probability, the probability that the 
teacher returns the counterexample Xj associated with the target concept Cj is 
at least 1/2 per query. Thus, the probability that the target is not known after 
[log(l/<5)] EQs is at most 5. Thus, with probability at least 1 — <5, one more 
query suffices to receive answer YES. • 

As the number of EQs needed to learn 1-spheres from arbitrary counterex- 
amples equals the size s of the 1-sphere, and the upper bound in Theorem 
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does not depend on s at all, the model of EQ-learning from the U-teacher for a 
fixed distribution D is, in general, more powerful than the ordinary model. The 
gap between the number of EQs needed in both models can be made arbitrarily 
large. 

Recall that T>unif denotes the class of distributions that are uniform on a 
subdomain S C X and assign zero probability to instances from X \ S. 

Theorem 15 The following lower hound even holds for randomized learners: 

LCf^{C) > /1(C) > (1 - 5)svh{C). 



Proof. The first inequality is trivial. We prove the second one. 

Let S = {xi, . . . , Xs}, and let C\, . . . ,Cs be the concepts from C used to 
represent Si,...,Ss € Hg{So), respectively. For j = l,...,s, let Dj be the 
probability distribution that assigns zero probability to Xj and is uniform on the 
remaining instances from S. Clearly, Dj G Dunif- 

A learner must receive answer YES with probability at least 1 — i5 of success 
for each pair (C, D), where C G C is the target concept, and counterexamples are 
returned randomly according to D G T>. It follows that, if target concept Cj is 
drawn uniformly at random from {Ci, . . . , Cg}, and counterexamples are subse- 
quently returned according to Dj, answer YES is still obtained with probability 
at least 1 — i5 of success. Note that we randomize over the uniform distribution 
on the 1-sphere (random selection of the target concept), over the drawings of 
distribution Dj conditioned to the current sets of counterexamples, respectively, 
and over the internal coin tosses of the learner. 

Assume w.l.o.g. that all hypotheses are consistent with the counterexamples 
received so far. Let C be the next hypothesis, and S' C S the subset of instances 
from S being labeled 1 by C' . Because Hg{So) is representable by C, S' must 
differ from Sq on at least one element of S'. If S' = Sj, then the learner receives 
answer YES. Otherwise, the set {7 = (S' 0 Sj) \ {xj} is not empty. Note that 
the counterexample Xi to C' is picked from U uniformly at random. This leads 
to the removal of only Ci from the current version space V. 

The punchline of this discussion is that the following holds after the returnal 
of q counterexamples: 

1. The current version space V contains s—q candidate concepts from Ci , . . . , Cg. 
They are (by symmetry) statistically indistinguishable to the learner. 

2. The next hypothesis is essentially a random guess in V, that is, the chance to 
receive answer YES is exactly 1/|V|. The reason is that, from the perspective 
of the learner, all candidate target concepts in V are equally likely0 

This might look unintuitive at first glance, because the learner does not necessarily 
draw the next hypothesis at random from V according to the uniform distribution. 
But notice that a random bit cannot be guessed with a probability of success larger 
than 1/2 no matter which procedure for “guessing” is applied. This is the kind of 
argument that we used. 
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If answer YES is received before s EQs were issued, then only because it was 
guessed within V by chance. We can illustrate this by thinking of two players. 
Player 1 determines at random a number between 1 and s (the hidden target 
concept). Player 2 starts random guesses. The probability that the target number 
was determined after q guesses is exactly q/s. Thus, at least (1 — i5)s guesses are 
required to achieve probability 1 — (5 of success. • 



Corollary 16 Let TZ = (Y, A, R, fj,) be a representation class defining a doubly 
parameterized concept class C. The following lower bound holds for all m and n, 
even for randomized learners: 

LC^{n, TO, to) > (n, TO, to) > (1 — 6)sphj^{n, to, to) 

This means that, considering learning algorithms that do not make queries longer 
than the size of the target concept, the information-theoretic barrier for EQ- 
learning from arbitrary counterexamples is still a barrier for EQ-learning from 
'Dunif-teacheTS. This negative result even holds when the learner is randomized, 
so that this implies that it applies as well to the model of EQ-learning from T>- 
samples, which has been proved earlier to be subsumed by randomized learners 
from D-teachers. 

On the other hand, note that the results of this section generalize readily to 
the case in which the hypotheses queried come from a different class TL larger 
than C, or in particular to learners which query hypothesis of size up to M > to 
when the target is known to have size at most to. The point is that, for this 
case, the last corollary does not have anymore the interpretation that we have 
described, since the sphere number no longer is guaranteed to coincide with the 
strong consistency dimension, so the lower bound for the learning complexity 
that we would obtain is no longer, in principle, the same information-theoretic 
barrier as for EQ-learning. 
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Abstract. This paper derives the Vapnik Chervonenkis dimension of 
several natural subclasses of pattern languages. For classes with un- 
bounded VC-dimension, an attempt is made to quantify the “rate of 
growth” of VC-dimension for these classes. This is achieved by comput- 
ing, for each n, size of the “smallest” witness set of n elements that is 
shattered by the class. The paper considers both erasing (empty substitu- 
tions allowed) and nonerasing (empty substitutions not allowed) pattern 
languages. For erasing pattern languages, optimal bounds for this size 
— within polynomial order — are derived for the case of 1 variable oc- 
currence and unary alphabet, for the case where the number of variable 
occurrences is bounded by a constant, and the general case of all pattern 
languages. The extent to which these results hold for nonerasing pattern 
languages is also investigated. Some results that shed light on efficient 
learning of subclasses of pattern languages are also given. 



1 Introduction 

The simple and intuitive notion of pattern languages was formally introduced 
by Angluin P and has been studied extensively, both in the context of formal 
language theory and computational learning theory. We give a brief overview of 
the work on learnability of pattern languages to provide a context for the results 
in this paper. We refer the reader to Salomaa for a review of the work 

on pattern languages in formal language theory. 

In the present paper, we consider both kinds of pattern languages: erasing 
(when empty substitutions are allowed) and nonerasing (when empty substitu- 
tions are not allowed). Angluin |2| showed that the class of nonerasing pattern 
languages is identifiable in the limit from only positive data in Gold’s model |S|. 
Since its introduction, pattern languages and their variants have been a subject 
of intense study in identification in the limit framework (for a review, see Shino- 
hara and Arikawa m)- Learnability of the class of erasing pattern languages was 
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first considered by Shinohara m in the identification in the limit framework. 
This class turns out to be very complex and it is still open whether for finite 
alphabet of size > 1, the class of erasing pattern languages can be identified in 
the limit from only positive datcQ- 

Since the class of nonerasing pattern languages is identifiable in the limit from 
only positive data, a natural question is if there is any gain to be had if negative 
data is also present. Lange and Zeugmann HH observed that in the presence 
of both positive and negative data, the class of nonerasing pattern languages is 
identifiable with 0 mind changes; that is, there is a learner that after looking at 
a sufficient number of positive and negative examples comes up with the correct 
pattern for the language. This restricted “one-shot” version of identification in 
the limit is referred to as finite identification. 

Since finite identification is a batch model, finite learning from both positive 
and negative data may be viewed as an idealized version of Valiant’s PAC 
model. In this paper, we show that even the VC-dimension of patterns with one 
single variable and an alphabet size of 1 is unbounded. This implies that even 
this restricted class of pattern languages is not learnable in Valiant’s sense, even 
if we omit all polynomial time constraints from Valiant’s definition of learning. 
This result (which holds for both the nonerasing and erasing cases) may appear 
to be at odds with the observation of Lange and Zeugmann HU that the class 
of nonerasing pattern languages can be finitely learned from both positive and 
negative data. The apparent discrepancy between the two results is due to a 
subtle difference on the manner in which the two models treat data. In finite 
identification, the learner has the luxury of waiting for a finite, but unbounded, 
sample of positive and negative examples before making its conjecture. On the 
other hand, the learner in the PAC model is required to perform on any fixed 
sample of an “adequate” size. So, clearly the conditions in the PAC setting with 
respect to data presentation are more strict. 

Since this restricted class and several other subclasses of pattern languages 
considered in this paper have unbounded VC-dimension, we make an attempt to 
quantify the rate of growth of VC-dimension for these classes. This is achieved 
by computing, for each n, size of the “smallest” witness set of n elements 
that is shattered by the class. The motivation for computing such a Vapnik 
Chervonenkis Witness Size is as follows. Although classes with unbounded VC- 
dimension are not PAC-learnable in general, they may become learnable under 
certain constraints on the distribution. An often used constraint is that the dis- 
tribution favors short strings. Therefore, an interesting question is: How large is 
the VC-dimension if only strings of a certain length are considered? Mathemat- 
ically, it is perhaps more elegant to pose the question: What is the least length 
m such that n strings of size up to m are shattered? We refer to this value of 
m as the Vapnik Chervonenkis Dimension Witness Size for n and express it as 
a function of n, vcdws{n). Hence, higher the growth rate of the function vcdws, 

^ See Mitchell HU where a subclass of erasing pattern languages is shown to be iden- 
tifiable in the limit from only positive data. This paper also shows that the class of 
erasing pattern languages is learnable if the alphabet size is 1 or oo. 
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smaller is the “local” VC-dimension for strings up to a fixed length and “easier” 
it may be to learn the class under a suitably constrained variant of the PAC 
model. 

Although the VC-dimension of 1-variable pattern languages is unbounded, 
we note that if at least one positive example is present, the VC-dimension of 
nonerasing pattern languages becomes bounded (this can be formally expressed 
in the terminology of version-spaces). Unfortunately, this does not help in the 
case of erasing pattern languages, as we are able to show that the VC-dimension 
of 1-variable erasing pattern languages is unbounded even in the presence of a 
positive example. 

In fc-variable patterns, the bound k is on the number of distinct variables in 
the pattern and not on the total number of occurrences of all variables. We also 
consider the case where the number of occurrences of all the variables in a pattern 
is bounded. We show that the VC-dimension of the class of languages generated 
by patterns with at most 1 variable occurrence is 2. For variable occurrence count 
> 2, the VC dimension turns out to be unbounded provided the alphabet size is 
at least 2 and the pattern has at least two distinct variables. We also consider 
the case where the only requirement is that each variable occur exactly n times 
in the pattern (so, there is neither any bound on the number of distinct variables 
nor any bound on the total number of variable occurrences). We show that the 
VC dimension of languages generated by patterns in which each variable occurs 
exactly once is unbounded. We note that this result also holds for any general n. 

Having established several VC-dimension results, we turn our attention to 
issues involved in efficient learning of pattern language subclasses. One prob- 
lem with efficient learning of pattern languages is the JVP-completeness of the 
membership decision [H q This NP-completeness result already implies that pat- 
tern languages cannot be learned polynomially in Valiant’s sense when the hy- 
potheses are patterns (because Valiant requires that, for a given instance, the 
output of the hypothesis must be computable in polynomial time) . Schapire [22] 
strengthened this result by showing that pattern languages cannot be polynomi- 
ally PAC-learned independent of the representation chosen for the hypotheses. 
Computing the output of a hypothesis cannot be done in polynomial time using 
any coding scheme which is powerful enough for learning pattern languages. Also, 
Ko, Marron and Tzeng have shown that the problem of finding any pattern 
consistent with a set of positive and negative examples is JVP-hard. Marron and 
Ko |lpj considered necessary and sufficient conditions on a finite positive initial 
sample that allows exact identification of a target /c-variable pattern from the 
initial sample and from polynomially many membership queries. Later, Marron 
m considered the exact learnability of /c-variable patterns with polynomially 



^ Angluin pQ showed that the class of nonerasing pattern languages is not learnable 
with polynomially many qneries if only eqnivalence, membership, and subset queries 
are allowed and as long as any hypothesis space with the same expressive power as 
the class being learned is considered. However, she gave an algorithm for exactly 
learning the class with a polynomial number of superset queries. 
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many membership queries, but where the initial sample consists of only a single 
positive example. 

In the PAC setting, Kearns and Pitt HI3 showed that /c-variable pattern 
languages can be PAC-learned under product distribution^ from polynomially 
many strings. At first blush, their result appears to contradict our claim that k- 
variable patterns have an unbounded VC-dimension. A closer look at their result 
reveals that they assume an upper bound on the length of substitution strings 
— which essentially bounds the VC-dimension of the class. When the substi- 
tutions of all variables are governed by independent and identical distributions, 
then fc-variable pattern languages can (under a mild additional distributional 
assumption) even be learned linearly in the length of the target pattern and 
singly exponentially in k uni. 

In this paper we show that in the case of nonerasing pattern languages, the 
first positive example string contains enough information to bound the necessary 
sample size without any assumptions on the underlying distribution. (This result 
holds even for infinite alphabets.) Unfortunately, as already noted this result 
does not translate to the case of fc-variable erasing pattern languages, as even in 
the presence of a positive example, the VC-dimension of single-variable erasing 
pattern languages is unbounded. 

We finally consider some results in the framework of agnostic learning P 
ITT) . Here no prior knowledge about the actual concept class being learned is 
assumed. The learner is required to approximate the observed distribution on 
classified instances almost as well as possible in the given hypothesis language 
(with high probability) in polynomial time. Agnostic learning may be viewed 
as the branch of learning theory closest to practical applications. Unfortunately, 
not even conjunctive concepts m and half-spaces uni are agnostically learnable. 
Shallow decision trees, however, have been shown to be agnostically learnable 

mu 



2 Preliminaries 

The symbol e denotes the empty string. Let s be a string, word, or pattern. 
Then the length of s, denoted |s|, is the number of symbols in s. A pattern 
cr is a string over elements from the basic language S and variables from a 
variable alphabet; we use lower case Latin letters for elements of S and upper 
case Latin letters for variables. The number of variables is the number of distinct 
variable symbols occurring in a pattern, the number of occurrences of variables 
is the total number of occurrences of variable symbols in a pattern. An erasing 
pattern language contains all words x generated by the pattern in the sense 
that every variable occurrence A is substituted within the whole word by the 
same string a a S LI*, a non-erasing pattern language contains the words where 

® More precisely, they require the positive examples in the sample to be generated 
according to a product distribution, but allow any arbitrary distribution for the 
negative examples. 
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the variables are substituted by non-empty strings only. Thus, in non-erasing 
pattern languages every word is at least as large as the pattern generating it. 

For example, if cr = aAbbBabAba, then the length is 10, the number of 
variables 2 and the number of variable occurrences is 3. In an erasing pattern 
language, cr generates the words abbabba (by A = e and B — e) and abbaabba 
(by A = e and B = a) which it does not generate in the nonerasing case. In both 
cases, erasing and nonerasing, cr generates the word aabbabaababa (by A = a and 
B = aba) . This allows us to define subclasses of pattern languages generated by 
a pattern with up to k variables or up to I variable occurrences. 

Quantifying Unbounded VC-Dimension. The VC-dimension of a class C 
of languages is the size of the largest set of words S such that £ shatters S. 
The VC-dimension of a class £ is unbounded iff, for every n, there are n words 
xi,X 2 , ■ ■ ■ ,Xn such that £ shatters them. As motivated in the introduction, we 
introduce the function 

vcdws{n) = min{max{|a;i|, \x 2 \^ ■ • ■ , \xn\} ■ £ shatters {xi,X 2 , ■ ■ ■ ,Xn} } 

where vcdws stands for Vapnik Chervonenkis Dimension Witness Size and re- 
turns the size of the smallest witness for the fact that the VC-dimension is at 
least n. Determining vcdws for several natural classes is one of the main results 
of the present work. 

Version Spaces. Given a set of languages £ and a set of positive and negative 
example strings S', the version space VS{£,S) consists of all languages in £ 
that generate all the positive but none of the negative strings in S. It follows 
from Theorem 2.1 of Blumer et al. jH| that £ can be PAC-learned with the 
sample S and a finite number of additional examples if VS{£, S) has a finite VC- 
dimension. In Section 3 we will show that, for certain classes, the VC dimension 
of the version space remains infinite after a sample S has been read while in 
Section 5 we show that, for other classes, the VC-dimension of the version space 
can turn from infinite to finite when the first positive example arrives. 

3 VC-Dimension of Erasing Pattern Languages 

Our first result shows that even the very restrictive class of 1-variable erasing 
pattern languages over the unary alphabet has an unbounded VC-dimension. 
This special case is the only one for which the exact value of vcdws is known. 

Theorem 1. The VC-dimension of the class of erasing 1-variable pattern lan- 
guages is unbounded. If the size of the alphabet is 1, then one can determine the 
exact size of the smallest witness by the formula vcdws{n) = P 2 ■ Pz ■ ■ ■ ■ ■ Pn-i 
where Pm is the m-th prime number {pi = 2,p2 = 3,ps = 5,p4 = 7, . . .). 

Proof: Let B = {a}. For the direction vcdws{n) < P 2 -P 3 ‘ ■ ■ ■ -Pn-i, let Xk = 
where mk = Pi ■ P 2 ■■■■ ■ Pn-i/Pk, that is, mu is the product of the first n — 1 
primes except the fc-th one. Let = e. For every subset E C {xi,X 2 , ■ ■ ■ ,Xn}, 
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let Pi; be the product of those pk where k < n and Xk ^ E and where pi; = 1 if 
E C {xn}- Now the patterns and generate a word o'" with to > 0 iff 

Pe divides to. Furthermore, the word e is generated by A^® but not by a^®A^®. 
So the language generated by A^® contains exactly the Xk & E in the case 
Xn & E] the language generated by contains exactly the Xk & E in the 

case Xn ^ E. Since the longest Xk is x\ whose length is p 2 • Ps ■■■ ■ -Pn-i one has 
that vcdws{n) < P 2 • Ps • • ■ • • Pn-i- 

For the converse direction assume that the erasing 1-variable pattern lan- 
guages shatter E = {o™F j • ■ • > a™"} where toi < to -2 < . . . < to„. 
Let generate all elements in E except a'"''. Now one has, for k' G 

{2, 3, . . . , n} — {k}, that = a'^k+ci-e^ _ ^dk+cy-e^ ^ _ 

(cfe' — ci)efc. On the other hand, ruk — toi is not a multiple of and nik — mi 

has a prime factor qk which does not divide any difference mk' — mi. It follows 

that for every difference toa, — toi there is one prime number qk dividing all other 
differences mk' — mi and therefore, any product of n — 2 of such different prime 
numbers must divide some difference mk — mi. The product q 2 ■ qs • ■ . ■ ■ qn/ q_k is 
a lower bound for mk and, for the k with the smallest number qk, mk is at least 
the product of the primes P 2 • Ps • • ■ • • Pn-i- So vcdws{n) > P 2 • Ps • ■ • ■ • Pn-i- I 

As noted, this is the only case where vcdws has been determined exactly. It 
will be shown that more variables enable smaller values for vcdws{n) while it is 
unknown whether larger alphabets give smaller values for vcdws{n) in the case of 
erasing 1-variable pattern languages. The above proof even shows the following: 
Given any positive example w, there is still no bound on the VC-dimension of 
the version space of the class with respect to the example set {w}. Since the 
given w takes the place of Xn in the proof above, one now gets the upper bound 
|w| -l-p 2 -P 3 • . ■ . -Pn-i instead of p 2 -pa • . . . -Pn-i and uses for Xk the words wa^*‘ 
with TOfc = Pi • P 2 • • ■ • • Pn-i/Pk in the case k < n and Xk = w in the case k = n. 

Theorem 2. For any positive example w, the VC-dimension of the version 
space of the class of all erasing 1-variable pattern languages is unbounded and 
vcdws{n) < |w| -l-p 2 • P 3 • ■ • ■ • Pn- 

However, if two positive examples are present, then in some rare cases it may 
be possible to bound the number of patterns. For example, let the alphabet be 
E = {a,b}. Then if both strings a and b are in the language and are presented as 
positive examples, it is immediate that the only pattern language that satisfies 
this case is S*. 

An alternative to limiting the number of variables is limiting the number of 
variable occurrences. If the bound is 1, then the class is quite restrictive and has 
VC-dimension 2. However, as soon as the bound becomes 2, one has unbounded 
V C-dimension . 

Theorem 3. The VC-dimension of the class of erasing pattern languages gen- 
erated by patterns with at most 1 variable occurrence is 2. 

One can generalize the result and show that every 1-variable erasing pattern 
language with up to k occurrences of this variable has bounded Vapnik Chervo- 
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nenkis dimension. This is no longer true for 2- variable erasing pattern languages 
with up to 2 occurrences. 

Theorem 4. The VC-dimension of the class of all erasing pattern languages 
generated by patterns with at most 2 variable occurrences is unbounded. Further- 
more, vcdws{n) < (3n + 2) -2”. 

Proof: For each k, let Xk be the concatenation of all strings a"6cr6a” where 
cr G {a, 6}" and the fc-th character of cr is a b. Now, for every subset E of 
{xi,X 2 , ■ ■ ■ ,Xn} let aE be a strings of length n such that aE^k) = a ii Xk ^ E 
and aE{k) = b if Xk € E. Now the language generated by AbcEbB contains Xk 
iff baEb is a substring of Xk iff the fc-th character in ue is a 6 which by definition 
is equivalent to Xk G E. So, the set {xi,X 2 , ■ ■ ■ , x„} is shattered. | 

The corresponding theorem holds also for nonerasing pattern languages since A 
and B take at least the string a" or even something longer. A natural question 
is whether the lower bound is also exponential in n. The next theorem answers 
this question affirmatively. 

Theorem 5. For given k, the class of all pattern languages with up to k variable 
occurrences satisfies vcdws{n) > . 

Proof: Given xi,X 2 , ■ ■ ■ ,x„, one needs 2”“^ patterns which contain xi and 
shatter X 2 ,X 3 , . . . ,Xn- Let m = |xi| which is a lower bound for the size if 
xi,X 2 , ■ ■ ■ ,Xn are the shortest words shattered by the considered class. A pat- 
tern generating xi has h < k variable occurrences and, for the Lth variable 
occurrence in this word, one has a beginning entry a; < |xi| < m, and the 
length bi of the variable in xi. Knowing xi, each pattern generating xi has 
a unique description with the given parameters. So one gets the upper bound 
1 -|- -|- . . . -I- for the number of patterns generating x\ and 

has > 2”“^, that is, m > | 

The next theorem is about the general case of the class of all erasing pattern 
languages. Strict lower bounds are vcdws{n) > log(n)/ log(| A|) for the case of 
alphabet size 2 or more and vcdws{n) > n — 1 for the case of alphabet size 1. 
These lower bounds are given by the size of the largest string within a set of n 
strings. These straightforward bounds are modulo a linear factor optimal for the 
unary and binary alphabets. 

Theorem 6. For arbitrary erasing pattern languages: 

(а) If E = {a} then vcdws{n) < 2n. 

(б) If E = {a, 6} then vcdwsfn) < 4 -|- 2 • log(n) . 

Proof (a) In this case E = {a}. Let x\ = a”+^, a ;2 = a”+^, . . . , = a”+” and 

if be a subset of a;i, 0 : 2 , . . . , a;„. Now let aE be the concatenation of those 
with Xk G E: Taking Ak = {a} and all other variables to e, the word generated 
by aE is Xk- If at least two variables are not empty or one takes a string strictly 




100 



Andrew Mitchell et al. 



longer than 1 then the overall length is at least 2n + 2 and the word generated 
outside {xi,X 2 , ■ ■ ■ ,Xn}- So, as generates exactly those words Xk which are in 
E and the erasing pattern languages shatter {x\,X 2 , ■ • ■ , Xn}- 

Proof (b) In this case E = {a, b}. Let m be the first integer such that the set 
Um contains at least n strings where Um consists of all words w G {a, 6}™ such 
that a, b occur similar often in w, the first character of w is a and aa, bb, ab and 
ba are subwords of w. One can show that m < 4 + 2 • log(n). Let xi,X 2 , ■ ■ ■ ,x„ 
be different words in Um- Now, for every k, let ak be the pattern obtained from 
Xk by replacing a by Ak and 6 by Bk, and let (Te be the concatenation of those 
(Tfc where Xk G E. Since every variable occurs in ue m/2 times, one has that 
either one variable is assigned to some fixed u G {a, 6}^ and the generated word 
is or there are two variables such that one of them is assigned to a and 
the other one to b. Since Xk yf — the word does not contain all 

subwords aa,ab,ba,bb — and since Xk ^ and since the 

first character of is a one can conclude that Ai = a and Bi = b for some 
1. Now the word generated by cte is xi and xi = Xk only for I = k. Thus cfe 
generates exactly those Xk with G if. I 

The trivial lower bound is constant 1 for infinite alphabet E. But it is impossible 
to have constant upper bound for the size of the smallest witness, indeed there 
is, for every k, an n with vcwds{n) > k. 



4 VC-Dimension of Nonerasing Pattern Languages 

Many of the theorems for erasing pattern languages can be adapted to the case 
of nonerasing pattern languages. In many theorems the upper bound increases 
by a sublinear factor (measured in the size of the previous value of the function 
vcdws). Furthermore one gets the following lower bound: 

Theorem 7. For any class of nonerasing pattern languages, vcdws{n) > 

n—1 

2-log(n+2) ■ 



Proof: Given xi,X 2 , ■ ■ ■ ,x„, one needs 2"“^ patterns which contain xi and 
shatter X 2 ,xs, . . . ,Xn- Let m = |a;i| which is a lower bound for the size if 
xi,X 2 , ■ ■ ■ ,Xn are the shortest words shattered by the considered class. A pattern 
generating x\ can be described as follows: To each position one assigns either 
the value 0 if this position is covered by a constant from the pattern generating 
it, the value h if it is the first character of some occurrence of the /i-th variable 
(h < m) and m + 1 if it is some subsequent character of the occurrence of some 
variable. Together with x\ itself, this string either describes uniquely the pat- 
tern generating x\ or is invalid if, for example, some variable occurring twice 
has at each occurrence a different length. So one gets at most (m-l-2)'" patterns 
which generate x\. It follows that (m -I- 2)™ > 2”“^. This condition only holds 
if ^ 2-\o^lu+2) ■ ■ 




The VC-Dimension of Subclasses of Pattern Languages 



101 



Furthermore, all lower bounds on vcdws carry over from erasing pattern lan- 
guages to nonerasing pattern languages. For upper bounds, the following bounds 
can be obtained by adapting the corresponding results for erasing pattern lan- 
guages. In these three cases, the upper bounds are only slightly larger than those 
for the erasing pattern languages, but in the general case with an alphabet of size 
2 or more, the above lower bound is 2 .iog^+ 2 ) nonerasing pattern languages 
while 4 -I- 2 • log(n) is an upper bound for the erasing pattern languages. 

Theorem 8. For 1-variable pattern languages, vcdws{n) < pi -p 2 ■ ■ ■ ■ -Pn, where 
Pi,P2, ■ ■ ■ ,Pn are the first n prime numbers. 

For patterns with exactly two variable occurrences, vcdws(n) < (3n -|- 2) • 2". 

If the alphabet size is 1, then vcdws{n) < ^ • (3n^ -I- 5n) for the class of all 
nonerasing pattern languages. 

Note that in the case the alphabet size is two, one can — using the well-known 
fact that nonerasing pattern languages shatter the set of the Xk = — 

obtain that vcdws{n) < n. 

5 Learning fc- Variable Nonerasing Patterns 

Having established several VC-dimension results in the previous section, we now 
present some PAC-learnability results. The fact that the VC dimension of k- 
variable pattern languages is infinite suggests that this class is not learnable. 
However, we will show that the version space becomes finite after the first positive 
example has been seen. 

Theorem 9. Let e and 6 be given. Let L be a k-variable pattern language and D 
be an arbitrary distribution on E* . Let S be an initial set of positive sentences of 
size at least one and let Imin = min{|w| | w S S}. Regarding the version space, 
we can claim that \VS {k-variable pattern languages, S)\ < {Imin + ky^^" . Let h 
be any pattern consistent with a sample of size at most m > log ^ . 

Then P{ErrL.D[h] > e) < 5. 

An exhaustive learner can find a consistent hypothesis (if one exists) after enu- 
merating all possible {Imin + kY’^''^ patterns. In order to decide whether a pattern 
is consistent with a sample the learner has to check if a: G L{h) for each example 
X and pattern h. While this problem is NP-complete for general patterns, it can 
be solved polynomially for any fixed number of variables k. 

Theorem 10. Given a sample S of positive and negative strings, a consistent 
nonerasing k-variable pattern h can be found in 0{{lmin + kY'^''" ■ max{|a;| | x G 
S}^) - that is, learning is - as in the case of PAG - polynomial in parameters 
j- and i but depends exponentially on the parameters Imin and k. 

An algorithm that learns a fc-variable pattern still has a run time which grows 
exponentially in Imin- Under an additional assumption on D and on the length 
of substitution strings, they become efficiently learnable for fixed fcHH. 
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Patterns with k variable occurrences. If we restrict the patterns to have at 
most k occurrences of any variables, they become even more easily learnable. The 
number of fc-occurrence patterns which are consistent with an initial example x 
is at most as large as the number of A:- variable patterns - that is, the logarithm 
of the hypothesis space size is polynomial which makes the required sample size 
polynomial, too. However, the learner can find a consistent hypothesis much 
more quickly. 

Theorem 11. Pattern languages with up to k occurrences of variables can be 
learned from a sample in 0{l^~^ ■ k^) - that is, polynomially for fixed k. 

The idea of the proof is that up to k substrings in the shortest example can be 
substituted by variables. Hence, we only need to try all “start and end positions” 
of the variables and to enumerate all possible identifications of some variables. 

6 Length-Bounded Pattern Languages 

In this section we show that length-bounded pattern languages are efficiently 
learnable - even in the agnostic framework. Due to lack of space our treatment 
here is informal. 

We assume the alphabet size to be finite. There are (|T'| -I- fc -I- 1)^ patterns 
of length at most A: (jT'l constants, up to k variables, and an empty symbol). It 
follows immediately from Theorem 1 of jO] that P{ErvL ,D[h*] < ErrL,D[h]-e) < 
6 when h minimizes the empirical error, h* = infft,gjj{ifrri_£i[h]} is the truly best 

approximation of L in H, and the sample size is at least m > ^ log . 

In other words, by returning the hypothesis with the least empirical error a 
learner returns (with high probability) a hypothesis which is almost as good 
as the best possible hypothesis in the language. Hence, length bounded pattern 
languages are agnostically learnable. In order to find h* , a learner can enumerate 
the hypothesis space in 0((|i7| -I- A: -I- 1)^). 

The union of length bounded patterns is the power set of the set of 
length bounded patterns - hence, this hypothesis space can be bounded 
to at most . This implies that the sample complexity is m > 

(|i:|+fc-n) log (that is polynomial in 1^1, i, and |) but we still 

need to find an algorithm which finds a consistent hypothesis in polynomial 
time - together this proves that this class is polynomially learnable P]. A greedy 
coverage algorithm which subsequently generates patterns which cover at least 
one positive and no negative example can be guaranteed to find a consistent hy- 
pothesis (if one exists) in 0((| A| -|- A;)^ • to'*') where is the number of positive 
examples (that is, polynomially for a fixed k). 

In order to learn unions of length bounded pattern languages agnostically 
we would have to construct a polynomial algorithm which finds an empirical 
error minimizing hypothesis. Note that this is much more difficult: The greedy 
algorithm will find a consistent hypothesis - if one exists. It may occur that every 
positive instance is covered by a distinct pattern. In the agnostic framework. 
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we would have to find the hypothesis which minimizes the observed error. An 
enumerative algorithm, however, would have a time complexity of ). 

7 Conclusion 

We studied the VC-dimension of several subclasses of pattern languages. We 
showed that even single variable pattern languages have an unbounded VC- 
dimension. For this and several other classes with unbounded VC-dimension 
we furthermore quantified the VC-dimension witness size, thus characterizing 
just how quickly the VC-dimension grows. We showed that the VC-dimension 
of the class of single variable pattern languages which are consistent with a 
positive example is unbounded; by contrast, the class of pattern languages with 
k variable occurrences which are consistent with a positive example is finite. 
Hence, after the first positive example has been read, the sample size which is 
necessary or good generalization can be quantified. This result does seem to 
vindicate recent attempts by Reischuk and Zeugmann (see also 0 ) to study 
feasible average case learnability of single variable pattern languages by placing 
reasonable restrictions on the class of distributions. 
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Abstract. This paper presents a computation of the dimension for 
regression in bounded subspaces of Reproducing Kernel Hilbert Spaces 
(RKHS) for the Support Vector Machine (SVM) regression e-insensitive 
loss function L^, and general Lp loss functions. Finiteness of the Uy 
dimension is shown, which also proves uniform convergence in probability 
for regression machines in RKHS subspaces that use the or general 
Lp loss functions. This paper presents a novel proof of this result. It also 
presents a computation of an upper bound of the Uy dimension under 
some conditions, that leads to an approach for the estimation of the 
empirical Uy dimension given a set of training data. 



1 Introduction 

The Uy dimension, a variation of the VC-dimension El, is important for the 
study of learning machines m In this paper we present a computation of the 
I4y dimension of real-valued functions L{y,f{x)) = \y — f{x)\P and (Vapnik’s 
e-insensitive loss function Le El) = \y- with / in a bounded 

sphere in a Reproducing Kernel Hilbert Space (RKHS). We show that the 
dimension is finite for these loss functions, and compute an upper bound on it. We 
also present a second computation of the Uy dimension in a special case of infinite 
dimensional RKHS, which is often the type of hypothesis spaces considered in 
the literature (i.e. Radial Basis Functions M). It also holds for the case when a 
bias is added to the functions, that is with / being of the form f = fo + b, where 
h G R and /o is in a sphere in an infinite dimensional RKHS. This computation 
leads to an approach for computing the empirical Ky dimension (or random 
entropy of a hypothesis space HH) given a set of training data, an issue that 
we discuss at the end of the paper. Our result applies to standard regression 
learning machines such as Regularization Networks (RN) and Support Vector 
Machines (SVM). 

For a regression learning problem using L as a loss function it is known 
m that finiteness of the Ky dimension for all 7 > 0 is a necessary and sufficient 
condition for uniform convergence in probability m So the results of this paper 
have implications for uniform convergence both for RN and for SVM regression 

m 



O. Watanabe, T. Yokomori (Eds.): ALT’99, LNAI 1720, pp. 106-^^3 1999- 
(c) Springer- Verlag Berlin Heidelberg 1999 



On the Dimension for Regression in Reproducing Kernel Hilbert Spaces 



107 



Previous related work addressed the problem of pattern recognition where 
L is an indicator function EQ. The fat-shattering dimension P was considered 
instead of the one. A different approach to proving uniform convergence for 
RN and SVM is given in m where covering number arguments using entropy 
numbers of operators are presented. In both cases, regression as well as the case 
of non-zero bias b were marginally considered. 

The paper is organized as follows. Section 2 outlines the background and 
motivation of this work. The reader familiar with statistical learning theory and 
RKHS can skip this section. Section 3 presents a proof of the results as well as 
an upper bound to the Ky dimension. Section 4 presents a second computation 
of the Py dimension in a special case of infinite dimensional RKHS, also when 
the hypothesis space consists of functions of the form f = fo + b where b G R 
and fo in a sphere in a RKHS. Finally, section 5 discusses possible extensions of 
this work. 



2 Background and Motivation 

We consider the problem of learning from examples as it is viewed in the frame- 
work of statistical learning theory HH- We are given a set of I examples 
{(xi,yi), .., (x;,y/)} generated by randomly sampling from a space X xY with 
X C R‘^, Y C R according to an unknown probability distribution P(x,?/). 
Throughout the paper we assume that X and Y are bounded. Using this set of 
examples the problem of learning consists of finding a function f : X ^ Y that 
can be used given any new point x € A to predict the corresponding value y. 

The problem of learning from examples is known to be ill-posed [minj. A 
classical way to solve it is to perform Empirical Risk Minimization (ERM) with 
respect to a certain loss function, while restricting the solution to the problem 
to be in a “small” hypothesis space HD. Formally this means minimizing the 
empirical risk /emp[/] = jYl\=i L{yi, f{xi)) with f G H, where L is the loss 
function measuring the error when we predict /(x) while the actual value is y, 
and is a given hypothesis space. 

In this paper, we consider hypothesis spaces of functions which are hyper- 
planes in some feature space: 



n=l 



with: 



E 






< 00 



( 2 ) 



where (f)n(x) is a set of given, linearly independent basis functions, A„ are given 
non-negative constants such that Spaces of functions of the form 

(ID can also be seen as Reproducing Kernel Hilbert Spaces (RKHS) [21 1 2j with 
kernel K given by: 
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K{x, y) = Yl A„(^„(x)(^„(y). 



( 3 ) 



n—1 



For any function / as in (P), quantity (j2I) is called the RKHS norm of /, ||/|||-, 
while the number D of features 4>n (which can be finite, in which case all sums 
above are finite) is the dimensionality of the RKHS. 

If we restrict the hypothesis space to consist of functions in a RKHS with 
norm less than a constant A, the general setting of learning discussed above 
becomes: 



An important question for any learning machine of the type (P is whether it 
is consistent: as the number of examples (x^, yi) goes to infinity the expected er- 
ror of the solution of the machine should converge in probability to the minimum 
expected error in the hypothesis space mMi. In the case of learning machines 
performing ERM in a hypothesis space P, consistency is shown to be related 
with uniform convergence in probability El, and necessary and sufficient condi- 
tions for uniform convergence are given in terms of the dimension (also known 
as level fat shattering dimension) of the hypothesis space considered PHI, which 
is a measure of complexity of the space. 

In statistical learning theory typically the measure of complexity used is 
the VC-dimension. However, as we show below, the VC-dimension in the above 
learning setting in the case of infinite dimensional RKHS is infinite both for Lp 
and Le, so it cannot be used to study learning machines of the form P. Instead 
one needs to consider other measures of complexity, such as the Ky dimension, 
in order to prove uniform convergence in infinite dimensional RKHS. We now 
present some background on the Ky dimension p. 

The Ky dimension of a set of real- valued functions is defined as follows: 

Definition 1. Let C < L{y,f{x)) < B, f G H, with C and B < oo. The Vy- 
dimension ofLinTL (of the set {L{y, /(k)), f G Tt}) is defined as the maximum 
number h of vectors (xi,yi ) . . . , (x.h,yh) that can he separated into two classes 
in all 2^ possible ways using rules: 

class 1 if: L{yi, f{x^)) > s + j 
class -1 if: L{y^, f{xi)) <5-7 

for f GTi and some C + j<s<B — If for any number N, it is possible to 
find N points (xi,?/i) . . . , (xjv,yjv) that can be separated in all the 2^ possible 
ways, we will say that the Vy -dimension of L inTt is infinite. 

For 7 = 0 and for s being free to change values for each separation of the 
data, this becomes the VC dimension of the set of functions HH. In the case 
of hyperplanes P , the Yy dimension has also been referred to in the literature 
na as the YC dimension of hyperplanes with margin. In order to avoid confu- 
sion with names, we call the YC dimension of hyperplanes with margin as the 



Minimize : } Y!i=i 

subject to : ||/||^ < A^. 



( 4 ) 
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dimension of hyperplanes (for appropriate 7 depending on the margin, as 
discussed below). 

The Viy dimension can be used to bound the covering numbers of a set of 
functions which are in turn related to the generalization performance of 
learning machines. Typically the fat-shattering dimension is used for this 
purpose, but a close relation between that and the Vly dimension makes the 
two equivalent for the purpose of bounding covering numbers and hence studying 
the statistical properties of a machine. The VC dimension has been used to 
bound the growth function Q'^{1). This function measures the maximum number 
of ways we can separate I points using functions from hypothesis space Ti.. If h is 
the VC dimension, then is 2* if / < h, and < otherwise (where e 

is the standard natural logarithm constant). In section 3 we will use the growth 
function of hyperplanes with margin to bound their VC dimension, which, as 
discussed above, is their Vy dimension that we are interested in. 

Using the Uy dimension Alon et al. PJ gave necessary and sufficient conditions 
for uniform convergence in probability to take place in a hypothesis space Ti.. In 
particular they proved the following important theorem: 

Theorem 1. (Alon et al. , 1997 ) Let C < L{y,f{yL))) < B , f G H, H be a set 
of bounded functions. The ERM method uniformly converges (in probability) if 
and only if the Uy dimension of L inTL is finite for every 7 > 0. 

It is clear that if for learning machines of the form (0 the Uy dimension of 
the loss function L in the hypothesis space defined is finite for V 7 > 0, then 
uniform convergence takes place. In the next section we present a proof of the 
finiteness of the Uy dimension, as well as an upper bound on it. 



2.1 Why Not Use the VC-Dimension 

Consider first the case of Lp loss functions. Consider an infinite dimensional 
RKHS, and the set of functions with norm ||/||^ < . If for any N we can find 

N points that we can shatter using functions of our set according to the rule: 

class 1 if : \y — /(x)|^’ > s 
class — 1 if : \y — /(x)|^’ < s 

then clearly the VC dimension is infinite. Consider N distinct points (x^, yi) with 
yi = 0 for all i, and let the smallest eigenvalue of matrix C with Cij = K(x.i, xj) 
be A. Since we are in infinite dimensional RKHS, matrix C is always invertible 
ra. so A > 0 since C is positive definite and finite dimensional (A may decrease 
as N increases, but for any finite N it is well defined and yf 0). 

For any separation of the points, we consider a function / of the form /(x) = 
X)i=i ct*^(xi,x), which is a function of the form JH). We need to show that we 
can find coefficients ai such that the RKHS norm of the function is < A^. Notice 
that the norm of a function of this form is cx^Ga where (o:)i = (throughout 
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the paper bold letters are used for noting vectors). Consider the set of linear 
equations 

TCj G class 1 : = sp + r] r] > 0 

Xj G class — 1 : = SP — rj rj > 0 

Let s = 0 . If we can find a solution a to this system of equations such that 
ol^Gol < A? we can perform this separation, and since this is any separation we 
can shatter the N points. Notice that the solution to the system of equations is 
G~^tj where rj is the vector whose components are {Tj)i = 77 when is in class 1 , 
and —77 otherwise. So we need {G~^r})'^ G{G~^'q) < ^ rj^G~^r] < A^. Since 

the smallest eigenvalue of G is A > 0 , we have that rj^G~^ri < ^ ^ . Moreover 

77^77 = Nrf. So if we choose 77 small enough such that ^ rf < 

the norm of the solution is less than , which completes the proof. 

For the case of the loss function the argument above can be repeated with 
Hi = e to prove again that the VC dimension is infinite in an infinite dimensional 
RKHS. 

Finally, notice that the same proof can be repeated for finite dimensional 
RKHS to show that the V G dimension is never less than the dimensionality D of 
the RKHS, since it is possible to find D points for which matrix G is invertible 
and repeat the proof above. As a consequence the VC dimension cannot be 
controlled by A^ . This is also discussed in H 3 - 

3 An Upper Bound on the Uy Dimension 

Below we always assume that data X are within a sphere of radius R in the 
feature space defined by the kernel K of the RKHS. Without loss of generality, 
we also assume that y is bounded between —1 and 1 . Under these assumptions 
the following theorem holds: 

Theorem 2 . The dimension h for regression using Lp (1 < p < 00) or 

2 

loss functions for hypothesis spaces Ha = {/(x) = Wn4>n{A) \ ^ — 

A^} and y bounded, is finite for\/j > 0. If D is the dimensionality of the RKHS, 
then h < 0(min(D,^-^^AA^±A)), 

Proof. Let’s consider first the case of the L\ loss function. Let B be the upper 
bound on the loss function. From definition Q we can decompose the rules for 
separating points as follows: 

class lit Pi - /(xj) > s + 7 

or7/i-/(xd < -(s + 7) . . 

class - 1 if 7/i - f{xf) < s - 7 ^ 

and yi - f{xf) > -(s - 7) 

for some "f < s < B — For any N points, the number of separations of the 
points we can get using rules (0 is not more than the number of separations we 
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can get using the product of two indicator functions with margin (of hyperplanes 
with margin): 



function (a) : class 1 if — /i(xi) > si + 7 

class - 1 if y* - /i(xi) < si - 7 , , 

function (b) : class 1 if y^ — /2(xi) > — (s2 — 7) ^ 

class - 1 if y^ - /2(xi) < -(s2 + 7) 

where /i and /2 are in TLa, 7 < si, S2 < -B — 7. This is shown as follows. 

Clearly the product of the two indicator functions ® has less “separating 
power” when we add the constraints si = S2 = s and /i = /2 = /■ Furthermore, 
even with these constraints we still have more “separating power” than we have 
using rules 0: any separation realized using © can also be realized using the 
product of the two indicator functions 0 under the constraints si = S2 = s and 
/i = /2 = /. For example, if y — /(x) > s + 7 then indicator function (a) will 
give +1, indicator function (b) will give also +1, so their product will give +1 
which is what we get if we follow 0. Similarly for all other cases. 

As mentioned in the previous section, for any N points the number of ways 

we can separate them is bounded by the growth function. Moreover, for products 

of indicator functions it is known El that the growth function is bounded by 

the product of the growth functions of the indicator functions. Furthermore, the 

indicator functions in 0 are hyperplanes with margin in the D + 1 dimensional 

space of vectors {^„(x),y} where the radius of the data is + 1, the norm of 

the hyperplane is bounded by + 1, (where in both cases we add 1 because of 

2 

y), and the margin is at least The dimension h-y of these hyperplanes 

is known to be bounded by < min((Z 3 + 1 ) + 1 , So the 

growth function of the separating rules 0 is bounded by the product of the 
growth functions (^)^"', that is Q{ 1 ) < whenever I > h^.li /i”®® is 

the Viy dimension, then /i”®® cannot be larger than the larger number I for which 
inequality 2 * < (^)2^t holds. From this, after some algebraic manipulations 
(take the log of both sides) we get that I < 5h^, therefore /i”®® < 5 min {D + 
2, (R +DM +1) ^ proves the theorem for the case of Li loss functions. 

For general Lp loss functions we can follow the same proof where © now 
needs to be rewritten as: 



class 1 if y* - /(x*) > (s + 7)^ 
or /(Xi) -yi > (s + 7)p 
class - 1 if y* - /(x*) < (s - 7)? 
and /(xi) - yi < (s - 7)p 

Moreover, for 1 < p < oo, (5 + 7)^ > sp ^ (since 7=((s + 7)pj — (s^j = 

((s + 7)p — sp )(((s + 7)^)^“^ + . . . + (sp )P“^) < ((s + 7)p — Sp)(B+ . . .B) = 
((s + 7) p — Sp){pB) ) and (s — 7) p < sp — ^ (similarly). Repeating the same 
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argument as above, we get that the dimension is bounded by 5 min (D + 
2, Finally, for the Lg loss function 0 can be rewritten as: 

class 1 if 2/i — /(xi) > s + 7 + e 

or /(xi) - > s + 7 + e .g. 

class — 1 if 2/i — /(xi) < s — 7 + e ^ 

and /(xi) -?/i<s-7 + e 

where calling = s + e we can simply repeat the proof above and get the same 
upper bound on the dimension as in the case of the L\ loss function. (Notice 
that the constraint 'y<s<B — j is not taken into account. Taking this into 
account may slightly change the Ky dimension for L^. Since it is a constraint, it 
can only decrease - or not change - the Fy dimension). 

These results imply that in the case of infinite dimensional RKHS the Ky 

dimension is still finite and is influenced only by 5 ■ In the next 

section we present a different upper bound on the dimension in a special case 
of infinite dimensional RKHS. 



4 The T4y Dimension in a Special Case 



Below we assume that the data x are restricted so that for any finite dimensional 
matrix G with entries Gij = K(xi,Xj) (where K is, as mentioned in the previous 
section, the kernel of the RKHS considered, and x^ ^ Xj for i ^ j) the largest 
eigenvalue of G is always < for a given constant M. We consider only the 
case that the RKHS is infinite dimensional. We note with B the upper bound of 
L{y, /(x)). Under these assumptions we can show that: 



Theorem 3. The Vj dimension for regression using L\ loss function and for 

hypothesis space Ha = {/(x) = Wn4>n(pi) + b \ XT ^ finite 

for V 7 > 0 . In particular: 



1. If b is constrained to be zero, then < 

2. If b is a free parameter, Uy < 4 






M^A-‘ 

— 



Proof of part 1. 

Suppose we can find N > points {{xi,yi ), ..., (xN-,yN)} that we can 

shatter. Let s G [ 7 , R — 7 ] be the value of the parameter used to shatter the 
points. 

Consider the following “separation’^: if \yi\ < s, then (xi,yi) belongs in class 
1. All other points belong in class -1. For this separation we need: 



\Vi- f{x^)\> s + ^,ii \Vi\ < s .. 

IVi - f{xz)\ < s - 7 , if \yi\ >s 

^ Notice that this separation might be a “trivial” one in the sense that we may want 
all the points to be +1 or all to be -1 i.e. when all \yi\ < s or when all \yi\ > s 
respectively. 
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This means that: for points in class 1 / takes values either + s + 7 + or 
j/i — s — 7 — (5i, for bi > 0. For points in the second class / takes values either 
j/i + s — 7 — (5i or 2/i — s + 7 + (5i, for bi G [0, (s — 7)]. So J3) can be seen as a 
system of linear equations: 



= k- ( 10 ) 

n— 1 

with U being + s + 7+ i5i, or y* - s - 7 - 5*, or + s - 7 - i5i, or - s + 7 + (5^, 
depending on i. We first use lemma 1 to show that for any solution (so ti are 
fixed now) there is another solution with not larger norm that is of the form 

Lemma 1. Among all the solutions of a system of equations m the solution 
with the minimum RKHS norm is of the form: c^i^(xi, x) with a = G~^t. 



For a proof see the Appendix. Given this lemma, we consider only functions 
of the form Q^i^(xi, x). We show that the function of this form that solves 

the system of equations dm has norm larger than A^. Therefore any other 
solution has norm larger than A^ which implies we cannot shatter N points 
using functions of our hypothesis space. 

The solution a = G~^t needs to satisfy the constraint: 

a^Ga = t^G-H < 



Let Xmax be the largest eigenvalue of matrix G. Then t^G ^ ^ . Since 

'^max 

» '~p » 

Xmax < t^G~^t > Moreover, because of the choice of the separation, 
> N"f^ (for example, for the points in class 1 which contribute to an 
amount equal to {yi + s + ^ + biY". \yi\ < s y^ + s > 0, and since ') + bi > 7 > 0, 
then (yi + s + 7 + (5i)^ > 7^. Similarly each of the other points ’’contribute” to 
t^t at least 7^, so t^t > So: 



t^G~H > 






> A^ 



since we assumed that N > . This is a contradiction, so we conclude that 

we cannot get this particular separation. 



Proof of part 2. 

Consider N points that can be shattered. This means that for any separation, 
for points in the first class there are > 0 such that lf(xi) + b — yil = s + j + bi. 
For points in the second class there are bi € [0, s — 7] such that \f{xi) + b — yi\ = 
s — 7 — As in the case 6 = 0 we can remove the absolute values by considering 
for each class two types of points (we call them type 1 and type 2). For class 
1, type 1 are points for which f{xi) = yi + s + j + bi — b = ti — b. Type 2 
are points for which f{xi) = yi — s — j — bi — b = U — b. For class 2, type 1 
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are points for which f{xi) = yi + s — 'y — Si — b = U — b. Type 2 are points 
for which /(xj) = yi — s + j + Si — b = U — b. Variables ti are as in the case 
b — 0. Let S'!!, S'i 2 , S'-!!, 5'-i 2 denote the four sets of points {Sij are points of 
class i type j). Using lemma 1, we only need to consider functions of the form 
f{x) = X)i=i ctiK(xi,x). The coefficients ai are given by a = G~^(t — b) there 
b is a vector of b’s. As in the case 6 = 0, the RKHS norm of this function is at 
least 

( 11 ) 

The 6 that minimizes OB is So (HU is at least as large as (after 

replacing 6 and doing some simple calculations) 2 NIvP ~ ^j)^- 

We now consider a particular separation. Without loss of generality assume 
that yi < 2/2 < ■ • ■ < 2/w and that N is even (if odd, consider iV — 1 points). 
Consider the separation where class 1 consists only of the ’’even” points {N,N — 
2, . . . , 2}. The following lemma is shown in the appendix: 

Lemma 2. For the separation considered, ~ least as large 



Using LemmaQwe get that the norm of the solution for the considered separation 
is at least as large as ■ Since this has to be < we get that N — 

, which completes the proof (assume V > 4 and ignore additive 



A <4 
N — ^ 






7 ^ 



constants less than 1 for simplicity of notation). 



In the case of Lp loss functions, using the same argument as in the previous 
section we get that the Uy dimension in infinite dimensional RKHS is bounded 

by ^ jjj the first case of theorem 0 and by 4 (p-^) ^ ^ the second 

case of theorem 0 Finally for loss functions the bound on the Uy dimension is 
the same as that for Li loss function, again using the argument of the previous 
section. 



4.1 Empirical Vy Dimension 

Above we assumed a bound on the eigenvalues of any finite dimensional matrix 
G. However such a bound may not be known a priori, or it may not even exist, 
in which case the computation is not valid. In practice we can still use the 
method presented above to measure the empirical Uy dimension given a set of I 
training points. This can provide an upper bound on the random entropy of our 
hypothesis space HU. 

More precisely, given a set of I training points we build the I x I matrix G 
as before, and compute it’s largest eigenvalue Amax- We can then substitute 
with Aniax in tbe computation above to get an upper bound of what we call 
the empirical Uy dimension. This can be used directly to get bounds on the 
random entropy (or number of ways that the I training points can be separated 
using rules ( 0 )) of our hypothesis space. Finally the statistical properties of our 
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learning machine can be studied using the estimated empirical Vly dimension (or 
the random entropy), in a way similar in spirit as in US]. 

5 Conclusion 

We presented a novel approach for computing the dimension of RKHS for 
Lp and loss functions. We conclude with a few remarks. First notice that in 
the computations we did not take into account e in the case of loss function. 
Taking e into account may lead to better bounds. For example, considering 
\f{x) — y\^,p > 1 as the loss function, it is clear from the proofs presented that 
the V-y dimension is bounded by ^ ^ ^ • However the influence of e seems 

to be minor (given that e << B). 

An interesting observation is that the eigenvalues of the matrix G appear in 
the computation of the Ky dimension. In the second computation we took into 
account only the largest and smallest eigenvalues. If the computation is made 
to upper bound the number of separations for a given set of points (random 
entropy or empirical Fy dimension) as discussed in section 4.1, then it may be 
possible that all the eigenvalues of G are taken into account. This can lead to 
interesting relations with the work in HSj. 
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Appendix 

Proof of Lemma m 

We introduce the N x oo matrix Ain = and the new variable = 

■^f=. We can write system IIIUII as follows: 

Az = t. (12) 

Notice that the solution of the system of equation cni with minimum RKHS 
norm, is equivalent to the Least Square (LS) solution of equation El Let us 
denote with z° the LS solution of system 112 We have: 

z° = (13) 

where -I- denotes pseudoinverse. To see how this solution looks like we use Sin- 
gular Value Decomposition techniques: 

A = USV^, 

tL" = VEU^, 

from which A^ A = VE'^V^ and (A^A)+ = where denotes the 

N X N matrix whose elements are the inverse of the nonzero eigenvalues. After 
some computations equation can be written as: 

z° = = {VE^Ul){U^S-^Ul)t = AG-H. (14) 

Using the definition of z° we have that 

OO OO iV 

n—1 n—1 i—1 

Finally, using the definition of Ai^ we get: 

oo N 

n—1 i—1 



which completes the proof. 
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Proof of Lemma 

Consider a point {xi,yi) in S'!! and a point in S'-!! such that yi > yj 

(if such a pair does not exist we can consider another pair from the cases listed 
below). For these points {ti — tj)^ = {yi + s + ^ + Si — yj — s + j + Sj)'^ = 
{{Vi — yj) + + Si + Sj)"^ > 47^. In a similar way (taking into account the 

constraints on the Si’s and on s) the inequality {ti — tj)'^ > can be shown to 
hold in the following two cases: 



{xi,yi) G Sii, {xj,yj) G S-n\J S-12, yi > yj 
{xt,yi) G S'12, {xj,yj) G S-ii\JS-i2,yi < yj 



Moreover 



,{U-t,)^>2 




\^jeS-ii IJ S-i2,Vi>yj 




2 




\^jeS-ii y S-i 2 ,yi<Vj 





( 16 ) 



( 17 ) 



since in the right hand side we excluded some of the terms of the left hand side. 
Using the fact that for the cases considered (U — tj)^ > 4 y^, the right hand side 
is at least 



87^ (number of points j in class — 1 with yi > yj)+ 

+87" (number of points j in class — 1 with yi < yj) ' 

het I\ and I2 be the cardinalities of and S'12 respectively. Because of the 
choice of the separation it is clear that JED is at least 

872 ((1 + 2 + . . . + /i)) + (1 + 2 + . . . + (/2 - 1))) 

(for example if /i = 2 in the worst case points 2 and 4 are in Sn in which case 
the first part of (II iSII is exactly 1 + 2 ). Finally, since Ii + I2 = , d I is at least 

87^ ^^2 proves the lemma. 
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Abstract. This paper provides a systematic study of incremental learn- 
ing from noise-free and from noisy data, thereby distinguishing between 
learning from only positive data and from both positive and negative 
data. Our study relies on the notion of noisy data introduced in mi- 
The basic scenario, named iterative learning, is as follows. In every learn- 
ing stage, an algorithmic learner takes as input one element of an infor- 
mation sequence for a target concept and its previously made hypothesis 
and outputs a new hypothesis. The sequence of hypotheses has to con- 
verge to a hypothesis describing the target concept correctly. 

We study the following refinements of this scenario. Bounded example- 
memory inference generalizes iterative inference by allowing an iterative 
learner to additionally store an a priori bounded number of carefully 
chosen data elements, while feedback learning generalizes it by allowing 
the iterative learner to additionally ask whether or not a particular data 
element did already appear in the data seen so far. 

For the case of learning from noise-free data, we show that, where both 
positive and negative data are available, restrictions on the accessibility 
of the input data do not limit the learning capabilities if and only if 
the relevant iterative learners are allowed to query the history of the 
learning process or to store at least one carefully selected data element. 
This insight nicely contrasts the fact that, in case only positive data 
are available, restrictions on the accessibility of the input data seriously 
affect the capabilities of all types of incremental learning (cf. 

For the case of learning from noisy data, we present characterizations 
of all kinds of incremental learning in terms being independent from 
learning theory. The relevant conditions are purely structural ones. Sur- 
prisingly, where learning from only noisy positive data and from both 
noisy positive and negative data, iterative learners are already exactly 
as powerful as unconstrained learning devices. 



1 Introduction 

The theoretical investigations in the present paper derive their motivation to 
a certain extent from the rapidly developing field of knowledge discovery in 
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databases (abbr. KDD). KDD mainly combines techniques originating from ma- 
chine learning, knowledge acquisition and knowledge representation, artificial 
intelligence, pattern recognition, statistics, data visualization, and databases to 
automatically extract new interrelations, knowledge, patterns and the like from 
huge collections of data (cf. j2) , for a recent overview) . 

Among the different parts of the KDD process, like data presentation, data 
selection, incorporating prior knowledge, and defining the semantics of the re- 
sults obtained, we are mainly interested in the particular subprocess of applying 
specific algorithms for learning something useful from the data. This subprocess 
is usually named data mining. There is one problem when invoking machine 
learning techniques to do data mining. Almost all machine learning algorithms 
are “in-memory” algorithms, i.e., they require the whole data set to be present 
in the main memory when extracting the concepts hidden in the data. However, 
if huge data sets are around, no learning algorithm can use all the data or even 
large portions of it simultaneously for computing hypotheses. Different methods 
have been proposed for overcoming the difficulties caused by huge data sets. 
For example, instead of doing the discovery process on all the data, one starts 
with significantly smaller samples, finds the regularities in it, and uses different 
portions of the overall data to verify what one has found. 

Looking at data mining from this perspective, it becomes a true limiting pro- 
cess. That means, the actual hypothesis generated by the data mining algorithm 
is tested versus parts of the remaining data. Then, if the current hypothesis is 
not acceptable, the sample may be enlarged or replaced and the data mining al- 
gorithm will be restarted. Thus, from a theoretical point of view, it is appropriate 
to look at the data mining process as an ongoing, incremental one. 

For the purpose of motivation and discussion of our research, we next intro- 
duce some basic notions. By X we denote any learning domain. Any collection C 
of sets c C A is called a concept class. Moreover, c is referred to as concept. An 
algorithmic learner, henceforth called inductive inference machine (abbr. IIM), 
takes as input initial segments of an information sequence and outputs, once in 
a while, a hypothesis about the target concept. The set TL of all admissible hy- 
potheses is called hypothesis space. The sequence of hypotheses has to converge 
to a hypothesis describing the target concept correctly. If there is an IIM that 
learns a concept c from all admissible information sequences for it, then c is said 
to be learnable in the limit with respect to hi (cf. |H|). 

Gold’s [inj model of learning in the limit relies on the unrealistic assumption 
that the learner has access to samples of growing size. Therefore, we investigate 
variations of the general approach that restrict the accessibility of the input data 
considerably. We deal with iterative learning, k-bounded example-memory infer- 
ence, and feedback identification of indexable concept classes. All these models 
formalize incremental learning, a topic attracting more and more attention in 
the machine learning community (cf., e.g., jt)l9i'Zll'z4j L 

An iterative learner is required to produce its actual guess exclusively from 
its previous one and the next element in the information sequence presented. 
Iterative learning has been introduced in m and has further been studied by 
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various authors (cf., e.g., ),'tiSl I ;-jl 1 41 1 01 1 iSl TTIj L Alternatively, we consider learn- 
ers that are allowed to store up to k carefully chosen data elements seen so 
far, where k is a priori fixed (fc-bounded example-memory inference). Bounded 
example-memory learning has its origins in m- Furthermore, we study feedback 
identification. The idea of feedback learning goes back to too. In this setting, 
the iterative learner is additionally allowed to ask whether or not a particular 
data element did already appear in the data seen so far. 



In the first part of the present paper, we investigate incremental learning 
from noise-free data. As usual, we distinguish the case of learning from only 
positive data and learning from both positive and negative data, synonymously 
learning from text and informant, respectively. A text for a concept c is an infinite 
sequence that eventually contains all and only the elements of c. Alternatively, 
an informant for c is an infinite sequence of all elements of X that are classified 
according to their membership in c. 

Former theoretical studies mostly dealt with incremental concept learning 
from only positive data (cf. mm)- It has been proved that (i) all defined 
models of incremental learning are strictly less powerful than conservative infer- 
ence (which itself is strictly less powerful than learning in the limit), (ii) feedback 
learning and bounded example-memory inference outperform iterative learning, 
and (iii) feedback learning and bounded example-memory inference extend the 
learning capabilities of iterative learners in different directions. In particular, it 
has been shown that any additional data element an iterative learner may store 
buys more learning power. 

As we shall show, the situation changes considerably in case positive and 
negative data are available. Now, it is sufficient to store one carefully selected 
data element in the example-memory in order to achieve the whole learning 
power of unconstrained learning machines. As a kind of side-effect, the infinite 
hierarchy of more and more powerful bounded example-memory learners which 
has been observed in the text case collapses. Furthermore, also feedback learners 
are exactly as powerful as unconstrained learning devices. In contrast, similarly 
to the case of learning from positive data, the learning capabilities of iterative 
learners are again seriously affected. 



In the second part of the present paper, we study incremental learning from 
noisy data. This topic is of interest, since, in real world-applications, one rarely 
receives perfect data. There are a lot of attempts to give a precise notion of 
what the term noisy data means (cf., e.g., 1 211 it] ). In our study, we adopt 

the notion from |23 (see also ( 23 !) which seems to have become standard when 
studying Gold-style learning (cf. |2f4f5j ) . This notion has the advantage that 
noisy data about a target concept nonetheless uniquely specify that concept. 
Roughly speaking, correct data elements occur infinitely often whereas incorrect 
data elements occur only finitely often. Generally, the model of noisy environ- 
ments introduced in aims to grasp situations in which, due to better simu- 
lation techniques or better technical equipment, the experimental data which a 
learner receives about an unknown phenomenon become better and better over 
time until they reflect the reality sufficiently well. 




On the Strength of Incremental Learning 121 



Surprisingly, where learning from noisy data is considered, iterative learners 
are exactly as powerful as unconstrained learning machines, and thus iterative 
learners are able to fully compensate the limitations in the accessibility of the 
input data. This nicely contrasts the fact that, where learning from noise- free 
text and noise-free informant, iterative learning is strictly less powerful than 
learning in the limit. Moreover, it immediately implies that all different models 
of incremental learning introduced above coincide. Furthermore, we characterize 
iterative learning from noisy data in terms being independent from learning 
theory. We show that an indexable class can be iteratively learned from noisy 
text if and only if it is inclusion- free. Alternatively, it is iteratively learnable 
from noisy informant if and only if it is discrete. 



2 Preliminaries 

Let IN = {0, 1,2,.. .} be the set of all natural numbers. By (.,.): IN x IN —> IN 
we denote Cantor’s pairing function. We write A^B to indicate that two sets 
A and B are incomparable, i.e., A\B and B\A^%. 

Any recursively enumerable set X is called a learning domain. By p{X) we 
denote the power set of X. Let C C p{X) and let c G C. We refer to C and c as to 
a concept class and a concept. Sometimes, we will identify a concept c with its 
characteristic function, i.e., we let c{x) = -I-, if a; G c, and c{x) = — , otherwise. 

We deal with the learnability of indexable concept classes with uniformly 
decidable membership defined as follows (cf. P). A class of non-empty concepts C 
is said to be an indexable concept class with uniformly decidable membership if 
there are an effective enumeration (cj)jgiN of all and only the concepts in C and a 
recursive function / such that, for all j G IN and all x G X, it holds f{j, x) = -I-, 
if a; G Cj, and f{j,x) = — , otherwise. We refer to indexable concept classes with 
uniformly decidable membership as to indexable classes, for short. 

Next, we describe some well-known examples of indexable classes. First, let 
S denote any fixed finite alphabet of symbols and let E* be the free monoid 
over E. Moreover, let X = E* he the learning domain. We refer to subsets 
L C A+ as to languages (instead of concepts). Then, the set of all context- 
sensitive languages, context-free languages, regular languages, and of all pattern 
languages form indexable classes (cf. [Ilf I Ij ). Second, let A„ = {0,1}" be the 
set of all n-bit Boolean vectors. We consider X = IJri>i learning domain. 

Then, the set of all concepts expressible as a monomial, a fc-CNF, a A:-DNF, and 
a fc-decision list constitute indexable classes (cf. |2 1 12t)j 1 . 

Finally, we define some useful properties of indexable classes. Let X be the 
underlying learning domain and let C be an indexable class. Then, C is said to 
be inclusion-free iff cffc' for all distinctive concepts c,c' G C. Let (wj)jgiN be 
the lexicographically ordered enumeration of all elements in X . For all c C A, 
by i'’ we denote the lexicographically ordered informant of c, i.e., the infinite 
sequence {{wj,c{wj)))j^^. Then, C is said to be discrete iff, for every c G C, 
there is an initial segment of c’s lexicographically ordered informant i‘^, say if, 
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that separates c from all other concepts c' G C. More precisely speaking, for all 
c' G C, if c yf c' then i% ^ 

3 Formalizing Incremental Learning 

3.1 Learning from Noise-Free Data 

Let X be the underlying learning domain, let c C be a concept, and let t = 
{xn)neTN be an infinite sequence of elements from c such that {xn \ n G IN} = c. 
Then, t is said to be a text for c. By Text{c) we denote the set of all texts for c. 
Alternatively, let i = {{xmbn))neTN be an infinite sequence of elements from 
X X {+, — } such that {a;„ | n G IN} = X, {xn | n G IN, = +} = c, and 
{xn I n G IN, bn = —} = co-c = X \ c. Then, we refer to i as an informant 
for c. By Info{c) we denote the set of all informants for c. Moreover, let f be a 
text, let i be an informant, and let y be a number. Then, ty and iy denote the 
initial segment of t and i of length y+1. Furthermore, we set | n < y}, 

i+ = {xn \ n<y,bn = +}, and i~ = {xn | n < y, = -}. 

As in ^I]j, we define an inductive inference machine (abbr. IIM) to be an 
algorithmic mapping from initial segments of texts (informants) to INU{?}. Thus, 
an IIM either outputs a hypothesis, i.e., a number encoding a certain computer 
program, or it outputs “?,” a special symbol representing the case the machine 
outputs “no conjecture.” Note that an IIM, when learning some target class 
C, is required to produce an output when processing any initial segment of any 
admissible information sequence, i.e., any initial segment of any text (informant) 
for any c G C. 

The numbers output by an IIM are interpreted with respect to a suitably 
chosen hypothesis space H — (/ij)jgiN. Since we exclusively deal with indexable 
classes C, we always assume that H is also an indexing of some possibly larger 
class of non-empty concepts. Hence, membership is uniformly decidable in 7i, 
too. Formally speaking, we deal with class comprising learning (cf. ^7\). When 
an IIM outputs some number j, we interpret it to mean that it hypothesizes hj. 

In all what follows, a data sequence a = (dn)ne]N for a target concept c is 
either a text t = (Xn)neiN or an informant i = ((xn, bn))neiN for c. By convention, 
for all y G IN, (Ty denotes the initial segment ty or iy. 

We define convergence of IIMs as usual. Let a be given and let M be an IIM. 
The sequence {M{uy))y^^ of M’s hypotheses converges to a number j iff all but 
finitely many terms of it are equal to j . 

Now, we are ready to define learning in the limit. 

Definition 1 (Eni) Let C he an indexable class, let c he a concept, and let 
Ti. — (/iy)jg]N be a hypothesis space. An IIM M LimTxtji [Limlnf-j-j]-identifies c 
iff, for every data sequence a with a G Text{c) [a G Info{c)], there is a j G TN 
with hj = c such that the sequence {M{ay))y^^ converges to j. 

Then, M LimTxtjj [Liminf jf\-identifies C iff, for all d G C, M LimTxtu 
[Liminf jj\-identifies d . 
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Finally, LimTxt [Liminf] denotes the collection of all indexable classes C 
for which there are a hypothesis space Ti! = o,nd an IIM M such that 

M LimTxtu' [Liminf j^i]-identifies C . 

In the above definition, Lim stands for “limit” . Suppose an IIM identifies 
some concept c. That means, after having seen only finitely many data of c the 
IIM reaches its (unknown) point of convergence and it computes a correct and 
finite description of the target concept. Hence, some form of learning must have 
taken place. 

In general, it is not decidable whether or not an IIM has already converged 
on a text t (an informant i) for the target concept c. Adding this requirement 
to the above definition results in finite learning (cf. jn|L The corresponding 
learning types are denoted by Fin Txt and Fin Inf . 

Next, we define conservative IIMs. Intuitively speaking, conservative IIMs 
maintain their actual hypothesis at least as long as they have not seen data 
contradicting it. 

Definition 2 (HI) Let C he an indexable class, let c be a concept, and let 
H = (hj)j^T^ be a hypothesis space. An IIM M Consvlxt-u [Consv Inf -j-j] -identi- 
fies c iff M LimTxtu [Liminf u]~identifies c and, for every data sequence a with 
a G Text{c) [a G Info(c)] and for any two consecutive hypotheses k = M^y) 
and j = M(ay+i), if A: S IN and k yf j, then h^ is not consistent with (Ty+ilj 
M ConsvTxtu [Consv Inf u\~identifies C iff, for all d G C, M ConsvTxtu 
[ConsvInfjf\-identifies c! . 

The learning types Consv Txt and Consvinf are defined analogously as above. 
The next theorem summarizes the known results concerning the relations 
between the standard learning models defined so far. 

Theorem 1 (|10|. |16|I 

(1) For all indexable classes C, we have C G Liminf . 

(2) FinTxt C Fininf C Consv Txt C LimTxt C Consvinf = Liminf. 

Now, we formally define the different models of incremental learning. 

An ordinary IIM M has always access to the whole history of the learning 
process, i.e., it computes its actual guess on the basis of all data seen so far. In 
contrast, an iterative IIM is only allowed to use its last guess and the next data 
element in a. Conceptually, an iterative IIM M defines a sequence (M„)„g]N of 
machines each of which takes as its input the output of its predecessor. 

Definition 3 ( [2BJ ) Let C he an indexable class, let c he a concept, and let 
TL = (/iy)jg]N be a hypothesis space. An IIM M ItTxtu [Itinf jf\-identifies c 
iff, for every data sequence a = (d„)„g]N with a G Text{c) [a G Info{c)], the 
following conditions are fulfilled: 

(1) for all n G IM, M„(cr) is defined, where Mq{(t) = M{do) as well as Mn+i{cr) = 
M{Mn{(j),dn+l)- 

(2) the sequence (M„(cr))„giN converges to a number j with hj = c. 

In the text case, g Ck, and, in the informant case, g Ck or g co-Ck. 
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Finally, M ItTxt-H [Itlnf^]-identifies C iff, for each c' G C, M ItTxtT-i 
[Itlnf^]-identifies c' . 

The learning types ItTxt and It Inf are defined analogously to Definition Q1 
Next, we consider a natural relaxation of iterative learning, named k-hounded 
example-memory inference. Now, an IIM M is allowed to memorize at most k of 
the data elements which it has already seen in the learning process, where A: S IN 
is a priori fixed. Again, M defines a sequence {Mn)ne:TN of machines each of 
which takes as input the output of its predecessor. Clearly, a A:-bounded example- 
memory IIM outputs a hypothesis along with the set of memorized data elements. 

Definition 4 ( [18J I Let C he an indexable class, let c he a concept, and let 
H — be a hypothesis space. Moreover, let A: £ IN. An IIM M BemkTxt-u 

[Bemklnfj^-identifies c iff, for every data sequence a = (d„)nGiN with a G 
Text{c) [a £ Info{c)], the following conditions are satisfied: 

(1) for all n £ IN, M„(cr) is defined, where Mq{u) = M{do) = {jo,So) such 
that So C {do} and card(So) < k as well as Mn+i{cr) = M(M„((T),d„) = 
On+i,S'„+i) such that Sn+i C U |d„+i} and card{Sn-\-i) < k. 

(2) the jn in the sequence {{jm Sn))neTN of M’s guesses converge to a number j 
with hj = c. 

Finally, M BemkTxt-n [Bemkinf .j^-identifies C iff, for each c' G C, M 
BemkTxt-j-i [Bemklnfj^]-identifiesc'. 

For every k G IN, the learning types BemkTxt and Bemulnf are defined 
analogously as above. By definition, BemoTxt = ItTxt and Bemoinf = Itinf. 

Next, we define learning by feedback IIMs. Informally speaking, a feedback 
IIM M is an iterative IIM that is additionally allowed to make a particular type 
of queries. In each learning stage n-|- 1, M has access to the actual input d„+i and 
its previous guess M is additionally allowed to compute a query from d„+i 
and jn which concerns the history of the learning process. That is, the feedback 
learner computes a data element d and gets a “YES/NO” answer A{d) such that 
A{d) = 1, if d appears in the initial segment cr„, and A(d) = 0, otherwise. Hence, 
M can just ask whether or not the particular data element d has already been 
presented in previous learning stages. 

Definition 5 (ISU) Let C be an indexable class, let c be a concept, and let 
H = be a hypothesis space. Let Q:INxA’^A’[Q:INxAi— > Xi, 

where Xi = X y. {-I-, — }] be a total computable function. An IIM M , with a query 
asking function Q, FbTxtn [Fbinf jfi-identifies c iff, for every data sequence a = 
(d„)„g]N with a G Text{c) [a G Info{c)], the following conditions are satisfied: 

(1) for all n £ IN, Mn{<j) is defined, where Mo{<j) = M{do) as well as = 

M{Mn{a),A{Q{Mn{a) 5 dn-\- 1 ) ) 5 dn-\- 1 ) . 

(2) the sequence (M„((r))„g]N converges to a number j with hj = c provided A 
truthfully answers the questions computed by Q. 

Finally, M FbTxtn [Fbinf jj\-identifies C iff, for each c' G C, M FbTxtn 
[Fblnfn]~identifies F . 

The learning types FbTxt and Fbinf are defined analogously as above. 
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3.2 Learning from Noisy Data 

In order to study iterative learning from noisy data we have to provide some 
more notations and definitions. 

Let X be the underlying learning domain, let c C be a concept, and let 
t = be an infinite sequence of elements from X. Following m, t is said 

to be a noisy text for c provided that every element from c appears infinitely 
often, i.e., for every x G c there are infinitely many n such that Xn = x, whereas 
only finitely often some x ^ c occurs, i.e., Xn G c for all but finitely many n G IN. 
By NText{c) we denote the collection of all noisy texts for c. For every y G IN, 
ty denotes the initial segment of t of length y + 1. We let = {xn | n < y}. 

Next, let i = ((xn, bn))neiN be any sequence of elements from X x {+,—}. 
Following 1221 . i is said to be a noisy informant for c provided that every el- 
ement X oi X occurs infinitely often, almost always accomplished by the right 
classification c(x). More formally, for all x G X, there are infinitely many n G IN 
such that Xn = X and, for all but finitely many of them, bn = c(x). By Nlnfo(c) 
we denote the collection of all noisy informants for c. For every y G IN, iy de- 
notes the initial segment of i of length y-hl, = {xn \ n < y, bn = +}, and 
iy = {Xn \ n<y, bn = -}■ 

In contrast to the noise-free case, now an IIM receives as input finite se- 
quences of a noisy text (noisy informant). When an IIM is supposed to identify 
some target concept class C, then it has to output a hypothesis on every ad- 
missible information sequence, i.e., any initial segment of any noisy text (noisy 
informant) for any c G C. Analogously to the case of learning from noise-free 
data, we deal with class comprising learning (cf. |77]L 

The learning types LimNTxt, FinNTxt, ConsvNTxt, ItNTxt, BemkNTxt, 
and FbNTxt as well as LimNInf , FinNInf, ConsvNInf , ItNInf, BemkNInf , 
and FbNInf are defined analogously to their noise-free counterparts by replacing 
everywhere text and informant by noisy text and noisy informant, respectively. 

The following theorem summarizes the known results concerning learning of 
indexable concept classes from noisy data. 

Theorem 2 

(1) FinTxt C LimNTxt C LimTxt. 

(2) Fininf C LimNInf C Liminf . 

(3) LimNTxt # LimNInf. 

4 Incremental Learning from Noise- Free Data 

4.1 The Text Case 

In this subsection, we briefly review the known relations between the different 
variants of incremental learning and the standard learning models defined above. 

All the models of incremental learning introduced above pose serious restric- 
tions on the accessibility of the data provided during the learning process. There- 
fore, one might expect a certain loss of learning power. And indeed, conservative 
inference already forms an upper bound for any kind of incremental learning. 
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Theorem 3 

(1) ItTxt C ConsvTxt. 

(2) FbTxt C ConsvTxt. 

(3) BemkTxt C ConsvTxt. 

Moreover, bounded example-memory inference and feedback learning enlarge 
the learning capabilities of iterative identification, but the surplus power gained 
is incomparable. Moreover, the existence of an infinite hierarchy of more and 
more powerful bounded example memory learners has been shown. 

Theorem 4 (M) 

(1) ItTxt C FbTxt. 

(2) ItTxt C BemiTxt. 

(3) For all fc G IN, BemkTxt C Bemk+iTxt. 

(4) Bemi Txt \ FbTxt ^ 0. 

(5) FbTxt \ B errikT xt 

A comparison of feedback learning and bounded example-memory inference 
with finite inference from positive and negative data illustrates another difference 
between both generalizations of iterative learning. 

Theorem 5 (|183) 

(1) UfeGiN'®emfeTa;f# Azn/n/. 

(2) Fin Inf C FbTxt. 

Finally, finite inference from text is strictly less powerful than any kind of 
incremental learning. 

Theorem 6 (||18|f FinTxt C ItTxt. 



4.2 The Informant Case 

Next, we study the strengths and the limitations of incremental learning from 
informant. Our first result deals with the similarities to the text case. 
Theorem 7 Fininf C It Inf C Lira Inf . 

Moreover, analogously to the case of learning from only positive data, feed- 
back learners and bounded example-memory learners are more powerful than 
iterative IIMs. But surprisingly, the surplus learning power gained is remark- 
able. The ability to make queries concerning the history of the learning process 
fully compensates the limitations in the accessibility of the input data. 

Theorem 8 Fbinf = Liminf . 

Even more surprisingly, the infinite hierarchy of more and more powerful 
fc-bounded example-memory learners, parameterized by the number of data ele- 
ments the relevant iterative learners may store, collapses in the informant case. 
The ability to memorize one carefully selected data element is also sufficient to 
fully compensate the limitations in the accessibility of the input data. 

Theorem 9 Bemilnf = Liminf. 
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Proof. It suffices to show that Liminf C Bemilnf. Let C be any indexable 
class and (cj)jgiN be any indexing of C. By Theorem^ C G Liminf. 

Let denote the lexicographically ordered enumeration of all elements 

in X. For all to G IN and all c C X, we set c™ = {wz | z < to, G c} and 
c”^ = {wz \ z > m, Wz G c}. 

We let the required 1-bounded example-memory learner M output as hy- 
pothesis a triple {F,m,j) along with a singleton set containing the one data 
element stored. The triple {F,m,j) consists of a finite set F and two numbers m 
and j. It is used to describe a finite variant of the concept cj, namely the concept 
FUc™. Intuitively speaking, Ff is the part of the concept cj that definitely does 
not contradict the data seen so far, while F is used to handle exceptions. For 
the sake of readability, we abstain from explicitly defining a hypothesis space Ti. 
that provides an appropriate coding of all finite variants h(^p „i,j) = F U c™ of 
concepts in C. 

Let c G C and let i = {{xn, 6„))nGiN G Info{c). M is defined in stages. 

Stage 0. On input (xo,bo) do the following: 

Fix TO G IN with Wm = xq. Determine the least j such that cj is consistent 
with (xo,bo). Set S = {(xo,bo)j. Output {{Fp ,m,j), S) and goto Stage 1. 
Stage n, n > 1. On input {{F,m, j), S) and (Xn,bn) proceed as follows: 

Let S = {(a:, 6)}. Fix z,z' G IN such that Wz = x and Wz' = a;„. If z' > z, 
set S' = {{xm bn)}. Otherwise, set S' = S. Test whether h(^p^jn,j) = F U c™ 
is consistent with (xn,bn). In case it is, goto (A). Otherwise, goto (B). 

(A) Output ((F, TO, j), S") and goto Stage n -I- I. 

(B) If z' < TO, goto (/9I). If z' > TO, goto (/32). 

(/3I) If bn = -k, set F' = F U {a;„}. If bn = — , set F' = F \ {x„}. Output 
{{F' ,m,j), S') and goto Stage n+1. 

{(32) Determine I = max {z, z'} and F' = {wr \ r < i, Wr G ^(F,m,j)}- If 
bn = -k, set F" = F' U {Xn}. If bn = — , set F" = F' \ {a;„}. Search 
for the least index k > j such that Ck is consistent with (a;„,6„). 
Then, output ((F", £, k), S') and goto Stage n -k 1. 

Due to space limitations, the verification of M’s correctness is skipped. □ 

On the other hand, it is well-known that, where learning from informant, 
iterative learning with finitely many anomalie^ is exactly as powerful as learning 
in the limit. Hence, Theorems 0 and 0 demonstrate the error correcting power 
of feedback queries and bounded example- memories. 

Next, we summarize the established relations between the different models of 
incremental learning from text and their corresponding informant counterparts. 

Corollary 10 

(1) ItTxt C It Inf . 

(2) BemkTxt C Bemilnf . 

(3) FbTxt C Fbinf. 

^ In this setting, it suffices that an iterative learner converges to a hypothesis which 
describes a finite variant of the target concept. 



128 Steffen Lange and Gunter Grieser 



Finally, we want to point out further differences between incremental learning 
from text and informant. 

Theorem 11 

(1) Itinf \ LimTxt ^ 0. 

(2) Bemi Txt \ Itinf ^ 0. 

(3) FbTxt \ Itinf 0. 

The figure aside summarizes the 
observed separations and coincidences. 

Each learning type is represented as a 
vertex in a directed graph. A directed 
edge (or path) from vertex A to vertex 
B indicates that A is a proper subset 
of B. Moreover, no edge (or path) be- 
tween these vertices imply that A and 
B are incomparable. 

5 Incremental Learning from Noisy Data 

5.1 Characterizations 

In this section, we present characterizations of all models of learning from noisy 
text and noisy informant. First, we characterize iterative learning from noisy 
text in purely structural terms. 

Theorem 12 C G ItNTxt iff C is inclusion-free. 

Since LimNTxt exclusively contains indexable classes that are inclusion- free 
(cf. H^l)? by Theorem 0 and since, by definition, ItNTxt C BemiNTxt and 
ItNTxt C FbNTxt, we arrive at the following insight. 

Theorem 13 

(1) ItNTxt = ConsvNTxt = LimNTxt. 

(2) ItNTxt = FbNTxt = UfegiM BemiNTxt. 

Interestingly, another structural property allows us to characterize the collec- 
tion of all indexable classes that can be iteratively learned from noisy informant. 

Theorem 14 C G ItNInf iff C is discrete. 

Proof. Necessity: Recently, it has been shown that every class of recursive 
enumerable languages that is learnable in the limit from noisy informant has 
to be discrete (cf. Stephan |22!; see also Case et al. jS], for the relevant de- 
tails). Clearly, this result immediately translates in our setting. Now, since, by 
definition, ItNInf C LimNInf, we are done. 

Sufficiency: Let C be an indexable class that is discrete. Informally speaking, 
the required iterative learner M behaves as follows. In every learning stage, 
M outputs an index for some concept, say c', along with a number k. The 
number k is an lower bound for the length of the shortest initial segment of c”s 
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lexicographically ordered informant that separates d from all other concepts 
in target class C. Since M does not know whether a new data element is really 
correct, M rejects its actual guess only in case the new data element contradicts 
the information represented in the initial segment . Moreover, since M is 
supposed to learn in an iterative manner, M has to use the input data to improve 
its actual lower bound k. 

We proceed formally. Let (cj)jgiN be any indexing of C. For all j e IN, 
denotes the lexicographically ordered informant of cj. As before, let (wj)jgiN be 
the lexicographically ordered enumeration of all elements in X . Moreover, let 
/ be any total recursive function such that, for all z S IN, there are infinitely 
many j € IN with /(j) = z. We select a hypothesis space Ti. = (^(j,fc) jj.feeiN that 
meets, for all j, A: G IN, = Cf^jy The required iterative IIM M is defined in 
stages. Let c € C and let i = ((xn, bn))nGiN be any noisy informant for c. 

Stage 0. On input (xo,bo) do the following: 

Set jo = 0, set ko = 0, output (jo, ko), and goto Stage 1. 

Stage n, n > 1. On input (j„_i,fc„_i) and (xn,bn) do the following: 

Determine p G IN with Wp = Xn- If p < kn-i, execute Instruction (A). 
Otherwise, execute Instruction (B). 

(A) Test whether or not = Cf(j^_y{x„). In case it is, set j„ = j„_i and 
kn = kn-i- Otherwise, set j„ = j„_i + 1 and kn = 0. Output (jn,fc„) 
and goto Stage n + 1. 

(B) For all z < p, test whether and In case 

there is a z successfully passing this test, set j„ = j„_i and kn = fc„_i + l. 
Otherwise, set j„ = j„_i and kn = kn-i- Output (jn,kn) and goto 
Stage n + 1. 

Due to space limitations, the verification of M’s correctness is skipped. □ 
Analogously to the text case, all models of learning from noisy informant 
coincide, except FinNInf . 

Theorem 15 

(1) ItNInf = ConsvNInf = LimNInf. 

(2) ItNInf = FbNInf = UfceiN BemuNInf . 

Furthermore, the collection of all indexable classes that can be finitely learned 
from noisy text (noisy informant) is easily characterized as follows. 
Proposition 1 Let C be an indexable class. Then, the following statements are 
equivalent: 

(1) C G FinNTxt. 

(2) C G FinNInf. 

(3) C contains at most one concept. 

5.2 Comparisons with Other Learning Types 

The characterizations presented in the last subsection form a firm basis for fur- 
ther investigations. They are useful to prove further results illustrating the re- 
lation of learning from noisy text and noisy informant to all the other types of 
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learning indexable classes defined. Subsequently, ItNTxt and ItNInf are used as 
representatives for all models of learning from noisy data, except finite inference. 

The next two theorems sharpen the upper bounds for learning from noisy 
data established in m (cf. Theorem 13 above). 

Theorem 16 ItNTxt C ConsvTxt. 

The next theorem puts the weakness of learning from noisy informant in the 
right perspective. 

Theorem 17 ItNInf C LimTxt. 

The reader should note that Theorem fTTI ca.nnot be sharpened to ItNInf C 
Consv Txt, since there are discrete indexable classes not belonging to Consv Txt 
(cf. PH). Since the class of all finite concepts is Consv Tirt-identifiable and ob- 
viously not discrete, we may conclude: 

Theorem 18 ItNInf ConsvTxt. 

The next theorem provides us the missing piece in the overall picture. 
Theorem 19 ItNTxt ff Fininf . 

The figure aside displays the established rela- 
tions between the different models of learning from 
noisy data and the standard models of learning in 
the noise-free setting. The semantics of this fig- 
ure is the same as that of the figure in the pre- 
vious section. The displayed relations between the 
learning models FinNTxt, FinNInf, and FinTxt 
are rather trivial. On the one hand, every single- 
ton concept class is obviously Txt-identifiable. 

On the other hand. Fin Txt also contains richer in- 
dexable classes. 

Recall that Assertion (3) of Theorem ^rewrites into ItNTxt ff ItNInf. As we 
shall see, this result generalizes as follows: All models of iterative learning are 
pairwise incomparable, except It Txt and It Inf . 

Theorem 20 

(1) ItNTxt # It Txt. (2) ItNTxt # It Inf . 

(3) ItNInf # It Txt. (4) ItNInf # Itinf. 
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Abstract. Learning in the limit deals mainly with the question of what 
can be learned, but not very often with the question of how fast. The 
purpose of this paper is to develop a learning model that stays very close 
to Gold’s model, but enables questions on the speed of convergence to 
be answered. In order to do this, we have to assume that positive ex- 
amples are generated by some stochastic model. If the stochastic model 
is fixed (measure one learning), then all recursively enumerable sets are 
identifiable, while straying greatly from Gold’s model. In contrast, we 
define learning from random text as identifying a class of languages for 
every stochastic model where examples are generated independently and 
identically distributed. As it turns out, this model stays close to learn- 
ing in the limit. We compare both models keeping several aspects in 
mind, particularly when restricted to several strategies and to the exis- 
tence of locking sequences. Lastly, we present some results on the speed 
of convergence: In general, convergence can be arbitrarily slow, but for 
recursive learners, it cannot be slower than some magic function. Every 
language can be learned with exponentially small tail bounds, which are 
also the best possible. All results apply fully to Gold-style learners, since 
his model is a proper subset of learning from random text. 



1 Introduction 

Learning in the limit as defined by Gold has attracted much attention. It can 
be described as follows: The words of a language are presented to a learner in 
some order, where duplicates are allowed. At each point of time the learner has 
seen only a finite subset of the language and thus gets an increasingly improved 
idea of the language. The learner issues hypotheses that at some point have to 
converge to a correct one (and never to be changed afterwards). 

The representation of a language as an infinite sequence, containing exactly 
the words of the language, is called a text. We will use the terms learning in the 
limit, identification in the limit, and learning from text synonymously. 

Much work on learning from text has addressed the question concerning what 
classes of languages are leamable and of what restrictions and combinations of 
restrictions for learners do restrict the power of identification. A set of learners 
is called a strategy. The most important strategy is recursive learners. In this 
paper we will deal with recursive as well as non-recursive learners. Another type 
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of a strategy is consistency p. A learner is consistent if all hypotheses denote 
a language that at least contains all words thus far seen. It is well known that 
consistency does not generally restrict learning power, but it does restrict the 
power of recursive learners. 

What has not been greatly adressed is how fast learning takes place. The 
reason for this is simple: With the exception of trivial cases there exists no 
upper bound on the learning time because texts can always be padded with use- 
less information at the beginning. Therefore, this leaves average case analysis 
as a possible way to proceed. A prerequisite for this is a stochastic model for 
the texts. In general these models form a text by generating every word in the 
text independently from the others and with an equal probability distribution, 
which is well motivated by several scientific contexts. Results on learning pat- 
tern languages and monomials P] emerged as a consequence. These 

results are bounds on the average number of examples or on the time required 
to learn on average according to a commonly large class of probability distribu- 
tions. Some of the papers also deal with aspects other than the average time or 
number of examples: They present tail bounds on the probability of convergence. 
More specifically, they show that some algorithm have exponentially small tail 
bounds CD!. There is also one general result: Every learner that is conservative 
and set driven automatically has small tail bounds as concerns every proba- 
bility distribution m- Knowledge about tail bounds enables a learner to stop 
after a finite time and to announce his hypothesis as correct with high proba- 
bility. There is a direct connection to stochastically finite learning, which was 
introduced in Uni (see also additional remarks in P). 

In general, fixing probability distributions for each language in some class 
leads to a learning model that is much more powerful than that of identifica- 
tion in the limit: The whole class of recursively enumerable languages becomes 
learnable (see jSj). This model is called measure one learning. Kapur and Bilardi 
present an elegant way to construct simple learners for a collection of languages 
if there is some minimal knowledge about the underlying probability distribu- 
tions 0. They also show how knowledge about distributions provides indirect 
negative evidence. 

In this paper we introduce the model learning from random text, which over- 
comes the difficulties above: It does not suffice to learn for fixed probability 
distributions, every reasonable distribution must be considered. In other words, 
a learner identifies a language from random text, if he converges to a correct 
hypothesis with probability one for every probability distribution for which the 
probability of a word is positive iff the word is a member of the language. This 
definition overcomes the difficulties mentioned above. Formal definitions are con- 
tained in the next section. (See also Kapur and Bilardi jZj for learning of indexed 
languages.) 

In the first part of this paper we investigate some basic properties of this 
model. In particular, if a learner learns from text he also learns from random 
text, but not necessarily vice versa. With this in mind we generalize the classical 
model, due to the fact that a learner is now allowed to fail on certain texts. 
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These texts, which we call the failure set of the learner, must have measure zero 
for every admissible distribution. However, if a class is learnable from random 
text, it can also be learned from text. This is also valid if we restrict ourselves 
to recursive learners. This equivalence can be translated to all strategies holding 
no restriction on identification in the limit, for which are many. 

To further compare the two models, we investigate some strategies that act 
as a restriction for recursive learning in the limit. Set-driven learners (whose 
hypotheses depend only on the set of examples seen, not on their multiplicity 
or order), confident learners (who converge on all texts for all languages), and 
memory limited learners (who can remember only a constant number of examples 
back in time). It will prove to be the case that the qualities of being set-driven 
or confident are restrictions for recursive learning from random text, too. The 
characteristic of limited memory, on the other hand, does not act as a restriction. 

Every learner that learns in the limit has a locking sequence 0 , whose exis- 
tence plays a crucial part in many proofs. After reading a locking sequence as a 
prefix of a text, a learner never again changes his hypothesis. Blum and Blum 
showed that every learner has a locking sequence and that every initial prefix 
of a text is a prefix of a locking sequence When learning from random text, 
locking sequences do not necessarily exist. We offer a simple characterization in 
terms of topological properties of the failure set, showing whether or not locking 
sequences nevertheless exist. The topology concerned is the natural topology on 
the sequences of words from the languages to be learned: A locking sequence 
exists iff (1) the failure set is not comeager and iff (2) it is not dense. Similarly, 
every prefix of a text is aa prefix of a locking sequence iff (1) the failure set is 
meager and iff (2) it is nowhere dense. 

In the second part of this paper we investigate the functions that map n to 
the probability that the learner has not converged to the correct hypothesis after 
reading n examples. This function will be called convergence indicator. The limit 
of a convergence indicator is zero for every admissible probability distribution, 
as to be expected. We show that, except for trivial cases, a convergence indi- 
cator cannot be smaller than exponential, i.e., it is always 17(1)". This is true 
regardless of the class of languages to be learned and the underlying probability 
distribution; hence we have a lower bound. Exists an upper bound as well? In 
other words, is it possible to learn arbitrarily slowly? The answer is generally 
‘yes’, but for recursive learners ‘no’. There exists some function / such that every 
recursive learner’s convergence indicator is o(/(n)). This forms a kind of magical 
barrier: Either you can learn faster or not at all. 

After finding lower and upper bounds for the convergence behavior of indi- 
vidual learners, we have to consider the more general task of finding the best 
convergence behavior among all learners that learn a class of languages: We can 
always construct a learner that learns with exponentially small tail bounds, i.e, 
its convergence indicator is 0(1)". 

We also show that if a learner is set-driven, then his tail bounds are auto- 
matically exponentially small. This phenomenon has already been recognised for 
when a learner is simultaneously set-driven and conservative m- 
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2 Preliminaries 

This section contains fundamental definitions. The notation is almost identical 
to that in Osherson, Stob, and Weinstein’s textbook |H|. The natural numbers 
are denoted by N = {1,2, 3,...}. A language is a subset of natural numbers. Let 
Wi be the language accepted by the ith Turing machine and RE = {Wi, W 2 , ■ ■ ■} 
the recursively enumerable languages. A text is a sequence of natural number. If t 
is a text, then ti denotes its *th component, i.e., t = {ti,t 2 , ■ ■ ■ ). The range rng(t) 
of a text t is the set of all its components, i.e., rng(t) = |ti, ^ 2 , ■ • ■ We say t is a 
text for L if rng(t) = L. The set of all texts for L is denoted by 7l and T is the set 
of all texts. The prefix of length n of a text t is denoted by = (tij ^ 2 ) • ■ • ,tn)- 
Let T be the set of all partial functions from N to N. We will identify finite 
sequences with natural numbers. If G T and lim„_,oo = L Le., 4>(tn) = i 
for all but finitely many n, we say that (j) converges on t to i and write (j){t) = i. 
We say 4> converges on t if it converges on t to some i. If = L for every 

text for L, then we say that (j) identifies L (in the limit or from text). If L C RE 
and (/) identifies every L G L then we say that (j) identifies L. 

Let § C T be a strategy. Then [§] denotes the set of all classes L C RE that 
are learnable by some ()) G §. In particular, [T] are all learnable classes and 
all recursively learnable classes. 

We define a topology 7l = (^p(l),Ol), where Tp(i) is the set of all texts 
for all subsets of L. The open sets Ol are defined by a base consisting of all sets 

= { t G ^P(L) \ in = (J for n = Ih(cr) }, i.e., all texts for all subsets of L that 
have a prefix cr. (lh((r) is the length of cr.) The induced topology is the sequence 
space over L. More details can be found in 0. 

With the help of 7l we can define a probability space (Tp(i), where 

21, the measurable sets, is the smallest cr-Algebra that contains all basic open 
sets and Ml is an admissible probability measure defined via Ml{B^) = 
where tol is a probability measure on L and must fulfill mL{n) > 
0 iff n G L and itil{L) = I. The random variable T denotes a random text for L; 
technically, T is the identity function on T. If P is a predicate for texts, we use 
the usual abbreviation Ml[P] = Mi({t | P{t) }). For example, Ml\T G B^] = 
Ml({ t G T I T{t) G }) = Ml{{ t G T I t G }) = Ml(B^). We are now 
ready to define learning from random text formally. 

Definition 1 A learner G T identifies a language L C RE from random text 
if Ml[W^(^t) = L] for all admissible probability measures M^. He indentifies a 
class £j from random text if he identifies all L G L from random text. The set of 
all classes identified by learners in § from random text is denoted by |§] . 

3 Relations to Identification in the Limit 

In this section we compare identification in the limit and identification from 
random text. The first result is that learning from random text is at least as 
powerful as learning from text for all strategies. Then we show that [T] = |T] 
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and [9“’'®'^] = i.e., that general learners, resp. recursive learners as a whole 

are equally powerful in both models. The next theorem is based on the simple 
fact that a random text is a text with probability one. 

Theorem 1 [§] C |§] for every set of strategies S C 9“. 

Proof. If S § identifies a language L from text then it identifies every text 
for L. We have to show that 4> identifies L on a random text with probability 
one. 

Let us fix some measure M^. We want to show that = 1. We can 

write Tl = ~ Uigl texts for L are all texts whose range 

is a subset of L minus all texts whose range is a proper subset of L. But Tp(i) 
is the sure event and therefore = 1. All that remains is to show that 

= 0 for every i G L. We can write 

7 C Xn = [J{ I lh((r) = n and cr contains no z } 

and M^^Xn) = (1 — rriL^i))'^ and consequently J2'^=i^L{Xn) < oo. The 
Borel-Cantelli Lemma implies Mi(limsupA„) = 0 and = 0, since 

Ti_p} = limsupA„. (limsupAn is the set of all elements that appear in in- 
finitely many A„.) Consequently AIl{7l) = 1. 

We have shown that (p identifies L on all texts except on a measure zero set. 
Consequently it identifies L from random text. □ 

The difficulty in the next theorem is to find a learner that fails on no text, if 
we only know that a learner exists that fails only on a set of measure zero. 

Theorem 2 [SF] = |T]. 

Proof. The C-part acts as a special case of Theorem d 

To prove the opposite direction let G T identify a class of languages L from 
random text. We can assume w.l.o.g. that (p{a) = (P{t) whenever 
We construct a tp G 7 that identifies L. 

The purpose of using tp is to simulate (p(t) on a random text constructed 
from t with a particular probability distribution. The following is a detailed 
definition of p}: 

Let a = (ni, ri 2 , • ■ • , nm) be some sequence. Here the measure TOo- is defined 
as ma{rii) = for z = 1, . . . , m and ma{n) = 0 if rz ^ rng((r). In this 

way we obtain a sequence of measures. There is a limit distribution mt{i) = 
lim„_,oo "min (*) that is a probability measure. The corresponding probability 
measure on texts is Mj, which is defined as mt{cTi). 

We know that lim„^oo = L] = 1 if t is a text for L from Theoremd 

because Mt is admissible for L. Let us fix the language L and a text t for L. We 
can then find for every |>e>0aniVGN, so that = L] > 1 — e for 

all n > N. To compute ip{in) it suffices to look at all (p{cr) where lh((r) = n and 
mt{a) > 0. There must be an z G { cp{a) \ Mt{Bjf) > 0 and lh((r) = zz}, so that 

Y,Mt{B!f)>l-e 



(7 
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where the sum is taken over all cr such that > 0, lh((r) = n, and 

(j){a) = i. Moreover, Wi = L. We cannot define ip{tn) as that i because mt is 
not known. However, and rrit are very similar. If cr contains no j ^ rng(t„), 
then clearly The Mj-measure of all B^ with Ih(cr) = n and 

rng(cr) % rng(t„) is at most n2“". Hence, 

(7 G G 

for sufficiently large n (such that n2“" < e/2). Consequently, we can define ip{tn) 
as that i for which the sum of all Mi^{B^) over all a with rng(cr) = rng(t„), 
Ih(cr) = n, and (j){a) = i is bigger than 1/2. If such an i does not exist, then 
ij}{tn) is undefined. It is not hard to see that ip identifies all L G L. □ 

The same is valid for recursive learners: 

Theorem 3 

Proof. The construction in Theorem El of ijj{tn) is not effective (this refers to 
the “w.l.o.g. ’’-section). Let (p G identify L from random text. We construct 
a different ip G that also identifies L from random text, but that has an 
additional property: If ip identifies a language L from random text then there is a 
single index i for each admissible Mr making Mr[(f>{T) = i] = l and Wi = L. In 
order to define ip we decompose T into a countable number of disjoint texts 
for example by = T(^k,u), where (fc,n) = 2'”3”. Then •0(T(fc,„)) = _min{ \ 

1 < f < A: }. (If m cannot be written as {k,n) then 'ip(Tm) = 'ip(T(^k,n)) where 
{k, n) is the next smaller number of this form.) For each i with Ml[(/(T) = i] > 0 
the probability that 4>{T^) = i for some k is one. Therefore with probability one 
ijj{T) = i where i the minimal number with Mr[4>{T) = f] > 0. 

If we use instead of (p in the construction of Theorem El we can avoid 
deciding whether Wi = Wj. Then the construction becomes effective. □ 

The remainder of this section deals with three strategies: set-driven, memory 
limited, and confident learners. If [§] = [T], then |§] = |SF] (a simple corollary 
of the above theorem). If, however, [§] C [T], then both |§] = |T] and |S] C |T] 
remain possible (the same applies for instead of T) . It is known that n 
^set-driven-^ ^ |l I |4j . A learner is set-driven if each hypothesis depends 

only on the set of examples so far seen, but not on their order or multiplicity. 
The next theorem implies j-set-d™en] ^ 

both models are equivalent for set-driven learners. If a learner is set-driven he 
cannot fail on a single text when learning from random text. Surprisingly, this is 
not true for rearrangement-independent learners, whose hypotheses depend only 
on the set of examples so far seen and on their number. 

Theorem 4 Let L G RE be a language where p G dnven from 

random text. Then p identifies L. 
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Proof. If L is finite then locks as soon as it has seen all words form L. Otherwise 
it would converge on no text at all to the right hypothesis. Hence, it converges 
on all texts. 

If L is infinite, then take an arbitraty text t = (ni, ri2, n.3, . . . ) for L. We 
can assume that all ni are pairwise distinct, since (f) is set-driven. Define M = 
{ {n\, ri2, , rim} | m. G N } and S' = { t G T | rng(t„) G M for all n G N }. 
Then there is a measure tol such that Ml{S) > 0, e.g., mL{rik) = ce~^ with 
c = short computation shows Ml{S) > 1/2). Since (j) is set- 

driven it converges on all texts from S to the same hypothesis or it diverges 
on all of them. The latter cannot happen, since S does not have a measure of 
zero. Therefore 4> converges to the correct hypothesis on all of S. Since t G S, cj> 
identifies t. □ 



Corollary 1 [§] = |§] for all § C ^'set-dr^ven^ 

The same result does not hold for rearrangement-independent learners which 
shows that the number of examples plays a crucial role in this context. 

Theorem 5 There is a S C J^rearrangement zndependent |§J ^ ^ 

Proof. Let L = {{1},N} and <j) defined via 

rh(rr\ ^ * if 1. : 1) ™g(cr) = {1,2,3,..., Ih(cr)} 

^ \j otherwise. 

Obviously, (j) is rearrangement independent (but not set-driven). While (j) iden- 
tifies L from random text, it does not identify L. Now choose § = {(/>}. 

It is easy to see that (p identifies N from random text since (p identifies 
every text for N unless ti 7 ^ ti for all i > 1. This happens with probability 
lim„^oo(l-mL(l))" = 0. □ 

Memory limited learners are an example of a natural strategy causing no 
restriction on learning from random text, but causing a restriction on learning 
in the limit. A learner p is memory limited if there is a number n such that p{a) 
depends only on ^f>(cri(T 2 . . . crih(cr)-i) and (Tih(cr), crih(a)-i, , crih(a)-n for all 
sequences a A memory limited learner remembers only his last hypothesis 
and the last n examples. 

Theorem 6 Urmtedj ^ 

Proof. First we show that every random text is a fat text (a text that contains 
everything infinitely often) with probability one. Let / : N — s- N x N be an 
arbitrary bijective function. Then T(^) = (T/(j 1 ), 2 ), 3 ), . . . ) is a random 

text for every i G N. We have already shown that every random text is a text 
with probability one. Hence, T contains an infinite number of texts Tp) with 
probability one. 
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Let L C RE and let 4> identify L from random text. Then there is a, ip 
that identifies L from text (Theorem 0 and we can assume that ip identifies 
L memory limited from fat text |^. Since every random text is a fat text with 
probability one, ip identifies L from random text, too. □ 

Confidency is an example, where old proof techniques partially transfer to 
learning from random text. In most cases this is not possible, in particular if 
the proofs are based on locking sequence arguments. A learner is confident if he 
converges on all texts, even on texts for a language that he does not identify. 

Theorem 7 |R™n/identj ^ 

Proof. We can adapt the proof of ^ Osherson, Stob, and Wein- 

stein jSj. While the basic principle remains the same, the proof for learning from 
random text is much more involved as they use failure on single texts, while we 
have to provide sets with non-zero measure. Suppose (p G 3^ identifies REfin (the 
finite languages) from random text. Let be the shortest sequence of zeros 
such that = {0}. Such (t° exists, since (0,0,...) is the only text for {0} 

and therefore every random text coincides with it with probability 1. Let be 
the shortest sequence of zeros and ones such that VR^(o-Ocri) = {0, !}■ Let L = N 

and Ml be admissible. Again, this exists because — i?^o^) > 0 

and therefore are many sequences starting with cr® and on which (p even iden- 
tifies {0, 1}. Generally we define tr” to be the shortest sequence of {!,... ,n} 
such that VE,^((t 1 o -2 O'") = {!> 2, . ■ . , n}. For analogous reasons as above cr" must 
exist. 

However, (p obviously does not converge on cr^cr^cr^ . . . and is therefore not 
confident. Since REfin € |T] the claim follows. □ 

4 Locking Sequences 

Locking sequences play a crucial role in many proofs. In general, no locking 
sequences may exist when learning from random text. The following theorems 
give simple characterizations, when locking sequences exist and when not. 

Lemma 1 Tl n and Bpp are equivalent modula a meager set in 7 l - 

Proof. Tl n B^ = B^ — UiGi Ba~^’‘^ is nowhere dense ii i ^ L: 
Every non-void open set contains some basic open set Bf that contains then 
itself RL, which is disjoint from Rct □ 



Theorem 8 Let (p S 7^ g fj-om random text. Let M he the 

measure zero set of texts for L on which (p does not identify L. Then the following 
three statements are equivalent: (1) Every a with rng((r) C L is prefix of a locking 
sequence for L. (2) M is meager m 0^. (3) M is nowhere dense in 7 l. 
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Proof. We show 2 1 ^ 3 ^ 2. Fix some admissible Ml and let rng(cr) C L. 

Since (j) identifies L on all texts in Tl — M, 

Tl n c U{ ^ 4 ) ^({0) I ^ is stabilized on L and a ^t}\J M. (1) 

Since and Tl n are equivalent modulo a meager set, Tl n B^ is not 
contained in a meager set by the Baire category theorem. In particular, the 
right hand side of m is not meager and since M is meager, some ^({t}) 

cannot be nowhere dense if t is stabilized on L and a t. Since is 

closed this means that 0 ^ Int(F^^({t})) C F^^{{t}). As a non-void closed set 
F^^{{t}) contains some basic open set B^. The corresponding sequence t ^ a 
is a locking sequence for (p on L. 

1 3: Let us assume that t is an accumulation point of M and let a \Z t. 

Then every open subset of B^ is not disjoint from M . Hence cr is not not a 
locking sequence. Since we can choose a arbitrarily, no prefix of t is a locking 
sequence. 

3 2: Every set that is nowhere dense is meager. □ 



Theorem 9 Let p S <jiotai 7 ^ g from random text. Let M be the 

measure zero set of texts for L on whieh p does not identify L. Then the following 
eonditions are equivalent: (1) 4> has a loeking sequence for L. (2) M is not dense 
inTr- (3) M is not eomeager in Tl- 

Proof. 1 => 2: Assume M is dense and cr is a locking sequence. This is not 
possible since Bpt n M 0 for every r with rng(T) C L because M is dense. 
Since cr is a locking sequence, however. Bp: C\ M = ^. 

2 3: If a set is not dense, it is not eomeager. 

3 I: If M is not eomeager, then Tl — M is not meager because Tl is itself 
eomeager. As before we can argue as follows: 

Tl — M C { Ff^(t) I t is stabilized on L } 

Since 7l~ M is not meager some Ff^{t) is not nowhere dense and contains as 
a closed set some basic open set Bp and cr is a locking sequence. □ 

The above two theorems are stated for total functions. The reason is that 
the proofs use F 4 ,, which is a function that maps texts to texts — and not to 
“partial texts.” It is possible, but very technical to modify the proofs for partial 
functions. A simpler way is the following, from which the same generalization 
immeadiately follows: 

Theorem 10 Let </> S T identify L C RE from random text. Then there is a 
ip G that identifies also C from random text and has the same failure sets 

as (j) for every L G £j. Moreover p and ip have the same loeking sequences for 
each L G L. 
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Proof. Let Wi ^ L. Define ip as 



ip{a) 



(j){a) if 4>{a) is defined, 
i otherwise. 



Obviously, (p and a share the same locking sequences. If (p diverges on a text, ip 
diverges, too, or converges to i thus failing to identify an L G L. If ^ identifies 
some L G L then ip converges to the same hypothesis. Hence, the failure sets are 
identical. □ 



Since (p G 3^ and ip G J"*®*"* share conditions (1), (2), and (3) in Theorems 0 
and E the three conditions are also equivalent for a partial function (p G 3^. 

Example how to use these characterizations: Can a learner fail on f iff t 
consists only of I’s with finitely many exceptions? The answer would be no, 
as there is obviously no locking sequence. Nevertheless the failure set is dense. 
Another application is the following theorem, which guarantees that confident 
learners have locking sequences. Nevertheless, confident learners can have non- 
void failure sets, but they cannot be dense. 

Theorem 11 Every (p G that identifies a language L from random 

text has a locking sequence for L. 



Proof. Since (p is confident it converges on every text. Therefore 

TlC U{ ^it) \t is stabilized } 

and again because of Baire’s category theorem some Ff^{t) is not nowhere dense 
and contains a basic open set Bp:. This cr is a locking sequence. □ 

This theorem can also be easily generalized such that even every sequence is 
prefix of a locking sequence. 



5 Tail Bounds 

It can be shown that general learners can learn arbitrarily slowly, i.e., the prob- 
ability that they still fail after n rounds (the convergence indicator) converges 
arbitrarily slowly towards zero (the proof is based on ignoring larger and larger 
segments of the text, which slows down learning). However, the next theorem 
shows that recursive learners cannot learn arbitrarily slowly: Either they con- 
verge “fast” or they cannot learn at all. 

Theorem 12 Let L G RE and let some tol be fixed. There is then a function 
/i: N-> Q such that f(n) = o(h(n)) for every f that is a convergence indicator 
for a (p G that identifies L with probability one. 



142 



Peter Rossmanith 



Proof. We define an oracle O : N x Q+ ^ Q such that \0{i,e) — tol(*)| < e, 
i.e., 0{i,e) is a rational number that is very near at rriL(i), but still remains a 
rational number. Moreover, let 0{i, e) = 0 iff m.L{i) = 0, i.e., iff z ^ L. The oracle 
O is not uniquely determined; we just choose some O with these properties. 

An oracle Turing machine that has access to O and additionally access to the 
halting problem for Turing machines with oracle O can compute a lower bound 
for f{n) (the details are omitted). 

Hence, whenever (j) measure one identifies a language L with convergence 
indicator /, then there is a function g with g(n) > f{n). Moreover, this g is 
computable by some kind of oracle Turing machine whose oracle depends only 
on mr, but not on (f). Therefore, let /i be a function shrinking so slowly towards 
zero for n ^ oo such that no oracle Turing machine as defined above can compute 
a function that shrinks slower (it can be easily constructed by diagonalization) . 
This h is the claimed function. □ 

The following theorem states that for each nontrivial learning problem expo- 
nential tail bounds can always be achieved and are also the best bounds possible. 

Theorem 13 If L G |T] (resp. L G and L contains at least two lan- 

guages that are not disjoint, then the conuergence indicator for some L G L is 
always 17(1)". On the other hand there is always a (j) G T (resp. 4> G that 

learns L from random text and whose convergence indicator for L is 0(1)". 

Proof. The lower bound follows from a text (i,i,i, . . .) where Wi is contained 
in two languages of L. Each learner fails on at least one of them. The upper 
bound follows from rearrangement-independent learners: If Si is identified by a 
rearrangement-independent learner, its convergence indicator is 0(1)", since he 
converges when the examples read contain a locking set. □ 

This proof also shows that every rearrangement-independent learner (and 
thus every conservative one) have automatically exponentially small tail bounds. 
It is already known that this is the case for learners that are simultaneously con- 
servative (or rearrangement-independent) and conservative fUj. Then, however, 
there exists a tight relationsship between the tail bounds and the expected learn- 
ing time, which is lacking if the learner is not conservative. 

6 Conclusion 

A stochastic model is a prerequisite to study the speed of learning in inductive 
inference, which was the main objective to start this line of research. There 
are several stochastic models available, where positive examples are generated 
independently and identically distributed according to a distribution. Kapur and 
Bilardi show how to construct learners in a uniform way jOj . 

If the same learner must identify a language for all reasonable probability 
distributions, then the only languages that are learnable are those that are also 
learnable in Gold’s model of learning in the limit |5| . We call the latter stochastic 
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model learning from random text. This model captures all classes of languages 
learnable in Gold’s model and none else. However, learners restricted in some 
ways can learn more classes from random text than from text. An example are 
memory limited learners. On the other hand, for many strategies the two models 
coincide. 

While there exist always locking sequences for Gold-style learners, this is 
not necessarily the case for learning from random text. The existence of locking 
sequences is closely related to the topological properties of the failure sets. 

The general results on the speed of learning are as follows. One problem of 
inductive inference is a learner does never know whether he already converged 
or whether he will have to change his hypothesis somewhere in the future. Ex- 
ponentially small tail bounds let the probability of the latter drop very fast, so 
exponentially small tail bounds are a useful property of a learner. We have seen 
that everything that can be learned at all can also be learned with exponentially 
small tail bounds, but not better. In particular, stochastically finite learning m 
is always possible in principle. 
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Abstract. The basis of inductive learning is the process of generating 
and refuting hypotheses. Natural approaches to this form of learning 
assume that a data item that causes refutation of one hypothesis opens 
the way for the introduction of a new (for now unrefuted) hypothesis, and 
so such data items have attracted the most attention. Data items that 
do not cause refutation of the current hypothesis have until now been 
largely ignored in these processes, but in practical learning situations 
they play the key role of corroborating those hypotheses that they do not 
refute. 

We formalise a version of K.R. Popper’s concept of degree of corroboration 
for inductive inference and utilise it in an inductive learning procedure 
which has the natural behaviour of outputting the most strongly corrob- 
orated (non-refuted) hypothesis at each stage. We demonstrate its utility 
by providing characterisations of several of the commonest identification 
types in the case of learning from text over class-preserving hypothe- 
sis spaces and proving the existence of canonical learning strategies for 
these types. In many cases we believe that these characterisations make 
the relationships between these types clearer than the standard charac- 
terisations. The idea of learning with corroboration therefore provides a 
unifying approach for the held. 



Keywords: Degree of Corroboration; Inductive Inference; Philosophy of Science. 



1 Introduction 

The field of machine inductive inference has developed in an ad hoc manner, 
in particular in the characterisations of identification types which have been 
achieved. In this paper we wish to propose a new unifying framework for the 
field based on the philosophical work of K. R. Popper, and in particular his con- 
cept of degree of corroboration. We will demonstrate that many of the existing 
identification types in the case of learning from text allow an alternative char- 
acterisation using the concept of learning with corroboration; in particular this 
approach reveals the existence of canonical learning algorithms for the various 
types. 



O. Watanabe, T. Yokomori (Eds.): ALT’99, LNAI 1720, pp. 145-^221 1999- 
(c) Springer- Verlag Berlin Heidelberg 1999 
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We will be concerned with learning indexable recursive families of recursive 
languages from text. We restrict our attention to the standard case of class- 
preserving hypothesis spaces, i.e. those indexed recursive families Hi,H 2 ,... for 
C such that for every L G C there exists at least one (and possibly many) i such 
that Hi describes L and every Hi describes some L in C. 

We assume the standard definitions in the field of machine inductive infer- 
ence [Go67, An80, AS83]. Definitions of other concepts used in this paper may 
be found as follows: strong monotonic learning (SMON-TXT) [Ja91]; refuting 
inductive inference machines (RIIMs) [MA93, LW94]; justified refuting learning 
{JREF-TXT) [LW94]; set-driven learning {s-*-TXT) [WC80, LZ94], 

Our notation will mostly be standard. We mention the following points. IN 
will be the natural numbers 0,1,2,... while 1N+ will be the positive integers 
1,2,3,.... Our languages L will be non-empty sets of words over a fixed finite 
alphabet A; therefore L C A*. We will write (A*) for the space of all finite 
and infinite sequences from A*; therefore if t is a text for language L, we have 
t G (A*). We write tm for the finite initial subsequence of t of length to -I- 1, 
and for the content of tm, i.e. if t = so,si,S 2 ,... then = {si | i < to}. 
Index{C) will be the set of all class-preserving recursive indexings C of class C of 
recursive languages; such indexings will be our hypothesis spaces. If hypothesis 
H G C describes language L G C, where £ G Index{C), then we abuse notation 
slightly by writing H = L. Similarly if Hi, H 2 G C describe the same L G C we 
write Hi = H 2 - We say that tm refutes H iS. tf^ % H. The set of all texts for 
H will be written Txt{H) while Txts{C) will be the set of all texts t such that 
(3i7 G C)t G Txt{H). 

2 Degree of Corroboration 

2.1 Popper’s ‘Logic of Scientific Discovery’ 

The philosopher K.R. Popper [Po34, Po54, Po57, Po63] defined a philosophical 
and logical system covering the epistemology and practice of science. A central 
plank of this system was the concept of degree of corroboration, C{x, y), meaning 
the degree to which a theory x receives support from the available evidence 
y. Evidence supporting x causes C{x,y) to increase in value, while evidence 
undermining x causes C{x,y) to decrease. A set of ten desiderata [Po34, Po57] 
defined C{x,y). Space precludes a full discussion of these desiderata here; the 
reader is referred to [Wa99]. 

Popper’s degree of corroboration is a practical measure enabling us to choose 
between unrefuted theories, given that a scientific theory is, by its very nature, 
incaple of proof (another major strand of Popper’s work was concerned with set- 
tling this point). We should tentatively believe the best-corroborated hypothesis 
at any given time. 

An interesting recent discussion of Popper’s work is to be found in Gillies 
[Gi93, Gi96]. There a logical system is characterised as one with both inferential 
and control elements; in both Popper’s work and the present paper corrobora- 
tion plays the role of control. Indeed given the characterisations using canonical 
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learners which we have obtained for standard learning types (Section 0) we may 
say that degree of corroboration is the only control element necessary in machine 
inductive inference. A more detailed discussion of Gillies’s work may be found 
in [Wa99]. 

2.2 Our Differences from Popper’s Approach - Discussion 

Restricted Domain We wish to define a corroboration function analogous to 
Popper’s but for use in the domain of inductive learning theory. This restricted 
domain enables us to make a number of simplifying assumptions compared to 
Popper’s version. 

First we note that we always wish to state how well a hypothesis is corrobo- 
rated by data. This is already more specific than Popper’s approach, in which he 
specifically allows the corroboration of, for example, one theory by another. Our 
hypotheses will be those of an inductive inference machine and will come from 
a particular hypothesis space, within which we aim to find a true description of 
the phenomenon producing the data, which will be a recursive language. The 
data will be a sequence of examples forming a text (or strictly speaking, forming 
at any particular time an initial segment of a text) for the phenomenon. 

c{H,t) will be the degree to which example text t corroborates hypothesis 
H. Lower case is used to distinguish our versions of Popper’s functions. 

Fixed Values We assume that data is free of noise, and that we aim to find a 
hypothesis which exactly describes or explains the concept producing the data. 
Now the idea that data undermines (Popper’s choice of word) a theory can be 
replaced by outright refutation in the case that data disagrees with the predic- 
tions of the theory. Thus all the possible negative values in Popper’s scheme may 
be replaced in ours by —1, the corroboration value of refuted hypotheses. 

Similarly the value 0, reserved by Popper for the degree of corroboration 
offered to x by an independent theory y, subtly changes its meaning when we 
restrict ourselves to corroboration of hypotheses by data. The value 0 is now 
the corroboration given to any theory by the empty data set 0, by vacuous data 
which gives us no help in choosing between competing hypotheses in our space, 
or in the case that the theory itself is tautological, metaphysical or otherwise 
not logically refutable. 

References to Probability For historical reasons. Popper’s desiderata are 
tied closely to definitions in probability; specifically. Popper sets out to demon- 
strate that degree of corroboration is in no sense a measure of probability. For 
our purposes, we have no need of any directly defined probabilistic measures. In 
a powerful argument. Popper identified the maximum degree of corroboration 
possible for a hypothesis with its logical improbability, and therefore with its 
scientific interest. Similarly, we use c{H) to mean the highest degree of corrobo- 
ration of which H is capable; however we drop the reference to Pfx) in Popper’s 
definition of C{x) and instead add some natural restrictions on c{H). 
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Popper’s dependence on probabilistic definitions leads him to restrict the 
maximum degree of corroboration in any case to the value 1. Objections to this 
unnecessary restriction led him to drop it in [Po57], and we do likewise. Further, 
we may drop the restriction of degrees of corroboration to real number values 
altogether, and use any partially ordered set S with a minimum element —1 such 
that S' — {— 1} has a minimum element 0, and decidable (recursive) relations >, < 
and CXI. 

2.3 Our Definition of Degree of Corroboration 

Let H range over hypotheses from our space C, and t over texts and finite initial 
segments of texts. We assume that c{H, t) ranges over some partially ordered set 
S with minimum element —1 and an element 0 minimal in S — {— 1}. We write 
c{H) for the maximum degree of corroboration possible for H . Falsifiers(iL) is the 
set of potential data items in S* which refute H . If Falsifiers(iLi) C Falsifiers (iLj) 
then we will write Hj C Hi to capture the natural Popperian sense that Hj is 
more easily refuted (potentially more strongly corroborable) than Hi. 

Our model of learning requires that c{H,t) and comparison (<) between 
degrees of corroboration are both recursive, but not necessarily that c{H) is 
recursive or that c{Hi) < c{Hj) is decidable. 

First we formally define our corroboration functions. 

Definition 1. A corroboration function c : C x (E*) S over L maps hy- 
potheses and texts to some set S with minimum element —1 and an element 0 
minimal in S'— {— 1} such that S has a decidable partial ordering <, and satisfies 
the following desiderata for all hypotheses H, H' £ C and all texts t,t' G (^*)' 

1. c{H,t) = —1 iff there exists data in t which refutes H. 

2. c{H, t) > 0 iff t does not refute H 

3. c{H, t) = 0 if t is empty or contains no data capable of refutation of any 
hypothesis in our space. 

4-. c{H) = max{Limn^ooc{H ,tn) \t is a text for H} is uniquely defined 

5. c{H) > c{H') if H CH' 

6. If t is a finite initial subsequence of t' then either c{H, f) < c{H, t') or 
c{H,t') = -1 

Note that item 5 in the definition implies that \i H = H' then c{H) = c{H'). 
Our definition of degree of corroboration is simpler than Popper’s because 
we have dropped all reference to probability and this gives us greater freedom 
when actually assigning values to our functions c{H) and c{H,t). We will see in 
the next section that certain inductive learning identification criteria will require 
corroboration functions with additional properties to those specified above. 

3 Learning with Corroboration 

In this section we cover the remaining assumptions and definitions necessary to 
define a theory of inductive learning with corroboration. 
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3.1 Hypotheses and Hypothesis Spaces 

All forms of inductive inference suffer from the problem that the learner is re- 
quired to choose one from among (typically) infinitely many hypotheses at each 
stage. Clearly no learner can consider all these hypotheses before it outputs a 
hypothesis or requests further data, so in effect there are only a limited number 
of hypotheses in play at any given time. Most authors gloss over this question 
as a matter of detail, or deal with it implicitly, but as we intend to propose a 
new unifying model for machine inductive inference, we feel constrained to deal 
with it explicitly. 

We therefore assume that along with our hypothesis space Hi, H 2 , ... we have 
a recursive, monotonically increasing function fp : IN — > IN with Limn^aoip{n) = 
oo which gives the number of hypotheses in play at stage n of any learning 
procedure with this hypothesis space. This leads to one slight concession with 
respect to our desiderata: hypotheses Hj which are not yet in play at stage n need 
not be considered to be either refuted or corroborated by tn, the examples seen 
to that stage - we therefore arbitrarily assign c{Hj,tn) = 0 for such n,j. This 
cannot cause confusion as these hypotheses are (by definition) not considered by 
any algorithm; it serves only to simplify some algorithms defined in the proofs. 



3.2 Corroboration Functions and Canonical Learners with 
Corroboration 

In the following section (Section^ we examine the use of corroboration in in- 
ductive learning and prove that many of the most natural inductive learning 
identification types can be characterised by an existence condition for a suitable 
corroboration function over the hypothesis space. Our intention is that this cor- 
roboration function (which is invariably recursive so no undecidability results 
are implied, nor is any additional computing power gained illicitly) will be used 
as an oracle by a canonical learner for the appropriate type; this demonstrates 
that there is effectively a single best learning strategy for each identification 
type, and only the details of the corroboration function change depending on 
the hypothesis space. 

The behaviour of a learner with corroboration is defined as follows. 

Definition 2. Turing machine A4, with oracle c{H,t) is called a learner with 
corroboration ifc{H,t) is a recursive corroboration function and on input t with 
hypotheses Hi,..,Hp in play, A4 outputs some i < p such that > 0 

is maximal among the c{Hj,t),j = l,...,p, if defined, and requests more input 
otherwise. 

If additionally A4 learns within identification type *, we call A4 a ^-learner 
with corroboration. 

Clearly such a learner is consistent with Popper’s dictum that we should 
prefer the most strongly corroborated hypothesis among competing hypotheses. 
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4 Characterising TXT-Identification Types in Learning 
with Corroboration 

In this section we are concerned only with learning from text, and often abbrevi- 
ate the names of identification types by dropping the -TXT. Our learners always 
work with respect to class-preserving hypothesis spaces. 

Lack of space precludes the inclusion of most proofs. The proofs of the The- 
orems follow the form of the proof of Theorem 0 with additional details for the 
more complex learning types. The Corollaries concerning canonical learners rely 
on the observation that in each case the learner defined in the <J= part of the 
proof of the preceding Theorem depends on C only via c. All proofs may be 
found in [Wa99]. 

4.1 LIM- and s-LIM-Learning 

Definition 3. A corroboration function c over C is called limiting iff 



Theorem 1. C G LIM-TXT iff there exists C G Index(C) such that there is a 
recursive limiting corroboration function c over C. 

Proof. (<J=) 

We define a learner Xi which uses such a recursive limiting c to L/M-learn 
any H G C. 

Let t be a text. Let the hypotheses in play at stage m be Hi, ...,Hp. At the 
(to -I- l)th stage (i.e. on input t^) X4 behaves as follows. 



Bestm = {i\i <pAc{Hi,tm) > 0 A (Vj < p)c{H„tm) c(Hj,t^)} 

M. is recursive: M. recursively computes c{Hi,tm) for i = 1, ...,p and forms 
the finite set of those i for which c{Hi, tm) is maximal under the recursive relation 
<. M now outputs the minimum such i, unless the set is empty, in which case 
it requests more input. 

On presentation of a text t for H, XA converges to some j such that Hj = H: 
fix t, an arbitrary text for H . Let n be that stage defined in Definition 3. Now 
there is some j with Hj = H such that at stage n and all subsequent stages m 
X4 will output j because j = min{Bestm) by assumption that c is a limiting 
corroboration function and the definition of XA . 



(ViL e £)(Vt G Txt{H)){3i)[H, = HA 
(3n)(ym > n)(yj)[c{H,,tm) > c{Hj,tm) V [c{Hi,tm) c{Hj,tm) A i < j]]] 




where 
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(=^) 

Suppose A4 is an inductive learning machine which L/M-learns C w.r.t. £. We 
define a recursive c which produces values (for degree of corroboration) ranging 
over IN U { — I}- Let 



{ —1 if tm refutes II j 
171 -p 1 if — j 

m otherwise 

c is recursive: it is decidable for any j whether tm refutes Hj, and by as- 
sumption Ai is an IIM. 

c is a limiting corroboration function over C: it is easily checked that c 
satisfies the conditions of Definition ^ and so c is a corroboration function. 

Let t be any text for H G C. By assumption there exists an index j such 
that Hj = H and a stage n after which Ai always outputs j. Therefore at all 
stages m > n we have c(Hj,tm) > c{Hk,tm) for all k yf j, which satisfies the 
requirements of Definition 3. 



Corollary 1. If C G LIM-TXT then there exists C G Index(C) such that there 
is a recursive limiting corroboration function c over C with the property that 

{\/H GC)(ytGTxt{H)){3i)[H^ = iLA(3n)(Vm > n)(Vj yf i)c{Hi,tm)> c{Hj,tm)] 



Corollary 2. There is a canonical LIM-learner with corroboration which will 
learn any C G LIM-TXT w.r.t. any C G Index{C) using any recursive limiting 
corroboration function c over L as an oracle. 

When considering the philosophical background for our model of learning, it 
seems clear that the order in which examples are presented to the learner, or the 
number of times the same example is repeated, has no significance. This leads 
us to the following definition. 

Definition 4. A corroboration function c over C = H\,H 2 , ... is called natural 
if on all texts t, u, for all m, n we have = ) = c{H^,Un). 

It might be objected that corroboration functions lacking the naturalness 
property should be disallowed. However, they are no more unnatural than non- 
set-driven learners (it is known [LZ94] that s-LIM-TXT C LIM-TXT). 

Theorem 2. C G s-LIM-TXT iff there exists C G Index{C) such that there 
exists a recursive natural limiting corroboration function c over L. 



Corollary 3. There is a canonical s-LIM-learner with corroboration which will 
learn any C G s-LIM-TXT w.r.t. any L G Index{C) using any recursive natural 
limiting corroboration function c over C as an oracle. 
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4.2 Conservative and Strong Monotonic Learning 

Definition 5. A corroboration function c : C x (S*) S over C is called 
attaining if 

{WH e £)(Vt e Txt{H))[{3j){3n)[Hj = H A c{Hj,tn) = c{Hj)]A 
(Vi)(Vm)[c(i/i,tm) = c{Hi) => 

fiH' e C)[[tm refutes H' V Hi :f> H'] A c{H' ,tm) if- c{Hi,tm)]] 



c is a recursive attaining corroboration function if both c and Cf : CxS ^{0,1} 
are total and recursive, where: 



Cf(H„s) 



lif s = c{Hi) 
0 otherwise 



Note that c{H,%) = 0 implies {\/i)c{Hi) > 0. 

Theorem 3. C G CONSERV-TXT iff there exists L € Index(C) such that there 
exists a recursive attaining eorroboration function c over C. 



Corollary 4. There is a canonical CON SERV -learner with corroboration which 
will learn any C G CONSERV-TXT w.r.t. any C G Index{C) using as an oracle 
any recursive attaining corroboration function c over C. 



Definition 6. A corroboration function c{H,f) over L = is called 

strict if 

(VH, G C){Vt G Txt{H,)){Vn)[c{H,,tr,) = c{H,) =A (VLf, D t+)Hj D H,] 

c is called a recursive strict corroboration function if both c and Cf are total 
and recursive, where Cf is as defined in Definition^ 



Theorem A. C G SMON-TXT iff there exists C G Index{C) such that there 
exists a recursive strict attaining corroboration function c over C. 



Corollary 5. There exists a canonical SMON-learner with corroboration which 
SMON-learns any C G SMON-TXT w.r.t. any C G Index{C) using any recursive 
strict attaining corroboration function over C as an oracle. 



Corollary 6. There is a eanonical (CONSERVUSMON)-learner with eorrob- 
oration which will CONSERV-learn any C G CONSERV-TXT w.r.t. any C G 
IndexiC) using any recursive attaining eorroboration function c over C as an 
oracle and will SMON-learn any C G SMON-TXT w.r.t. any C G Index{C) 
using any recursive strict attaining corroboration function c for C as an oracle. 
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4.3 FIN- and Refuting Learning 

Definition 7. Let C = be a hypothesis spaee. Then f : (L’*)xIN^ 

{0,1} is ealled a sufficiency function over C if 

(Vf)(Vrn)(Vn)[/(tm,n) = 1 

refutes HjM 

(3i < n)[t+ C K, A {yk)[Hk = Hi W tm refutes Hk]]]] 
and (Vt) (Vj) (VA: > j)(Vn)(Vm > n)[f{tj,n) = 1 /(tfc, m) = 1] 

Definition 8. Let f be a suffieieney funetion over C. 

f is ealled an inner sufficiency function over C if it additionally holds that 
for every text t G Txts{L), (3m,n)f{tm,n) = 1. 

If instead it holds that for every text t ^ Txts{C), (3m,n)f(tm,ri) = 1, then f 
is ealled an outer sufficiency function over C. 

Naturally the existence of a recursive (inner or outer) sufficiency function over 
£ is a very strong condition and allows particularly strong forms of learning. 

Theorem 5. C S FIN-TXT iff there exists £ G Index{C) such that there exists 
a recursive inner sufficiency function over £. 

Corollary 7. There exists a canonical FIN-learner which FIN-learns any C G 
FIN-TXT w.r.t. any £ G Index(C) using any recursive inner sufficiency function 
over £ as an oracle. 

We may use a sufficiency function to define a particularly strong form of 
corroboration function. 

Definition 9. c{H, t) is called a sufficient corroboration function over £ if there 
exists an inner sufficiency function f{t,n) over £ such that: 

(Vt)(Vz)(Vm)[[c(i7j,fm) > 0Ac(i7i,tm) = c^Hf)] ^ f(tm,i) = 1] 



and 



(Vf)(Vm)(Vn)[/(tm,n) = 1 ^ (3i < n)c{H„tm) = c{H,)] 

c is called a recursive sufficient corroboration function if both c and Cf are 
total and recursive, where Cf is as defined in Definition\^ 

Theorem 6. C € FIN-TXT iff there exists £ G Index{C) such that there exists 
a recursive sufficient corroboration function c over £. 

Corollary 8. There exists a canonical FIN-learner with corroboration which 
FIN-learns any C G FIN-TXT w.r.t. any £ G Index{C) using any recursive 
sufficient corroboration function over £ as an oracle. 
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Theorem 7 . C G JREF-TXT iff there exists C € Index(C) such that there 
exists a recursive outer sufficiency function f over C and a recursive limiting 
corroboration function c over C. 

Corollary 9. There exists a canonical JREF-learner with corroboration which 
JREF-learns any C G JREF-TXT w.r.t. any C G Index{C) using any recursive 
outer sufficiency function and any recursive limiting corroboration function over 
L as oracles. 

5 Example 

The corroboration functions constructed in the => proofs in Section 0 were sim- 
plistic. However in practical use, the existence or non-existence of appropriate 
corroboration functions may be suggested naturally by the space of hypotheses 
in use. We give an example of the use of corroboration functions to prove the 
learnability under certain identification criteria of a simple class. 

Our example languages will be sets of points in the rational plane Q^, so 
X = {{a,b) \ a,b G Q}. 

Example 1. Let C be the set of all closed circles of finite radius. Let <,> be 
a fixed recursive bijection between and IN’*' and <<,>> a fixed recursive 
bijection between Qf and Q. A suitable hypothesis space C = Hi, H 2 , ■■■ is given 

by 



H<a,b> = {{p,q) I a =« x,y» A{p- xf-\-{q-y)'^ < 5^} 

It is easily seen that £ is a class-preserving recursive indexing of C . 

Consider the following corroboration function c : £ x (X*) Q U { 00 }, 
which is based on the naturalistic idea that the further away a point is from a, 
the more severe a test it is of hypothesis H^a,b>- For circles of non-zero radius 
b we also include a scaling multiplier of 1 into the corroboration function, so 
that smaller circles are potentially more highly corroborable than large ones. 

if = 0 

if tm refutes H^a,b>, 
i.e. [o =<< X, y » 

A(3(c,d) G t+)[{c - x)^ -G {d - y)'^ > b^]] 
if 5 = 0 A a =<< X, y » A = {(a;, y)} 

^ l/b‘^*max{a,b,tra) otherwise 

where 

max{a, b,tm) = max{{{c — x)^ -G (d — y)^)/b^ | a =« x, y » A {c, d) Gt^} 

With a little checking we see that c is indeed a corroboration function under 
Definition n and is recursive and natural, c is limiting because on any text t for 
Hi we have a stage m at which tm contains two diametrically opposed points on 
the circumference of the circle defined by Hi. Then if we let z =< a, 6 >: 



'0 

-1 



^{,H <^a,b> ^ tm) — ^ 
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— =< c, d > A c?^ < 6^] ^ [tm refutes Hj A c{Hj,tm) = —1]] 

— (Vj)[[j =< c, d > Ad^ > 5^] — > (Vn > m)c{Hi,tn) = 1/6^ > 1/(P > 

c(^Uj , 

— (Vj)[[j =< c, d > A d^ = 6^] ^ [c = a V refutes Hj]] 

These are the only cases, so at all stages n > m we have that H^a,b> is the 
most strongly corroborated hypothesis (except for H^a,-b>, which is equally 
strongly corroborated and describes the same circle). 

c is also attaining because 

— if 6 = 0 then (Va)c(id<a,o>) = oo 

— if 5 0 then (Va)c(id<a,6>) = 1/6^ 

and for example c(7d<a,b>, to) = c{H^a,b>) where t = {x + b,y), ... is a text for 
H<ca,b> and a =<< x,y ». 

The above suffices to prove that C £ s-CONSERV-TXT, by Theorem 0. 

Finally we can see that c is not strict because for example (let 6 > 0) 
t = {x + b,y),... results in c(id«<x,y»,b>, to) = 1/^^ = c(id«<a;,;/»,6>) al- 
though many hypotheses Hj with H^^^x,y»,b> % Hj remain unrefuted. Never- 
theless it is possible to find a recursive, strict, attaining, limiting, set-driven cor- 
roboration function over C by requiring that two diametrically opposed points on 
the circumference of Hi must appear in the text before we set c(idj, tm) = c{Hi). 
This proves that C £ s-SMON-TXT. The details are left as an exercise for the 
reader. 

6 Conclusions and Future Work 

We have proposed a unifying model for machine inductive inference based on the 
philosophical work of K.R. Popper, and obtained characterisations of many of the 
standard identification types in learning indexed families of recursive languages 
from text. In our model canonical learners use recursive oracles which compute 
a version of Popper’s degree of corroboration. These learners then follow the 
natural strategy of preferring the most strongly (or at least a maximally strongly) 
corroborated hypothesis at any given time. Membership of a class of concepts 
within a particular identification criterion is then equivalent to the existence 
of a recursive corroboration function with certain properties depending on the 
identification type. 

We intend to extend this unifying model of learning to include language 
learning from informant and related problems such as learning of partial recur- 
sive functions. An extension of our approach to learning from noisy data would 
be particularly interesting; in this case it is no longer certain that a single ad- 
verse data item refutes a hypothesis and we would be obliged to allow negative 
corroboration values other than —1, as in Popper’s original model. Given the 
crucial role played by the hypothesis space in our model, it would also be in- 
teresting to extend this approach to cover exact and class comprising learning. 
Another interesting direction is to drop the requirement that our corroboration 
functions are recursive, thus obtaining a structure of ‘degrees of unlearnability’ 
analogous to the degrees of unsolvability of classical recursion theory. 
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Abstract. Flattening is a method to make a definite clause function- 
free. For a definite clause C, flattening replaces every occurrence of a term 
/(ti, • ■ • , tn) in C with a new variable v and adds an atom p/(ti, ■ ■ ■ ,t„,v) 
with the associated predicate symbol pf with / to the body of C. Here, 
we denote the resulting function-free definite clause from C by flat(C). 
In this paper, we discuss the relationship between flattening and implica- 
tion. For a definite program II and a definite clause D, it is known that 
if flat{n) 1= flat{D) then II \= D, where flat{II) is the set of flat{C) 
for each C € II. First, we show that the converse of this statement does 
not hold even if 77 = {C}, that is, there exist definite clauses C and 
D such that C \= D but flat(C) ^ fiat{D). Furthermore, we investi- 
gate the conditions of C and D satisfying that C \= D ii and only if 
flat{C) 1= flat{D). Then, we show that, if (1) C is not self-resolving and 
D is not tautological, (2) D is not ambivalent, or (3) C is singly recursive, 
then the statement holds. 



1 Introduction 

The purpose of Inductive Logic Programming is to find a hypothesis that ex- 
plains a given sample. It is a normal setting of Inductive Logic Programming 
that a hypothesis is a definite clause or a definite program and a sample is the 
set of (labeled) ground definite clauses. In this setting, the word “explain” is 
interpreted as either “subsume (denoted by ^)” or “imply (denoted by ^)”. 
In the latter case, note that the problem of whether or not a definite clause C 
implies another definite clause, called an implication problem, is undecidable in 
general 0. On the other hand, if C is function- free, then it is obvious that the 
implication problem is decidable. 

Flattening , which has been first introduced in the context of Inductive Logic 
Programming by Rouveirol m (though similar ideas had already been used in 
other fields), is a method to make a definite clause function- free. For a defi- 
nite clause C, flattening replaces every occurrence of a term /(7i, • • • ,7„) in C 
with a new variable v and adds an atom pf(ti,-- •,7„,u) with the associated 
predicate symbol pf with / to the body of C. Additionally, the unit clause 
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Pf{xi, ■ ■ ■ ,Xn, f(xi, ■ ■ ■ , Xn)) ^ is introduced to the background theory for each 
function symbol / in C. We denote the resulting function-free definite clause by 
flat{C) and the set of unit clauses by defs{C). 

Rouveirol m has investigated the several properties of flattening. Muggle- 
ton mu has dealt with flattening in order to characterize his inverting im- 
plication. De Raedt and Dzeroski P] have analyzed their PAC-learnability of 
jfc-clausal theories by transforming possibly infinite Her brand models into ap- 
proximately finite models according to flattening. Recently, Nienhuys-Cheng and 
de Wolf have studied the properties of flattening with sophisticated discus- 
sion. 

Rouveirol im (and Nienhuys-Cheng and de Wolf PI) has shown that flat- 
tening “preserves” subsumption: Let C and D be definite clauses. Then, it holds 
that: 



C ^ D \i and only if flat{C) ^ flat{D). 

Also Rouveirol 231 (and Nienhuys-Cheng and de Wolf 113) has claimed that 
flattening “preserves” implication: Let 7T be a definite program {Ci, ■ ■ ■ ,Cn} 
and D be a definite clause. We denote {flat{Ci),- ■ ■ ,/?at(C„)} and defs{Ci) U 
• • • U defs{Cn) by flat{II) and defs{II), respectively. Then, Rouveirol’s Theorem 
is described as follows: 

77 ^ Z? if and only if flat{II) U defs{II) \= flat{D). 

As the stronger relationship between flattening and implication than Rou- 
veirol’s Theorem, Nienhuys-Cheng and de Wolf H3| have shown the following 
theorem: 



If flat{n) ^ flat{D), then 77 ^ 77. 

If the converse of this theorem holds, then the several learning techniques for 
propositional logic such as 133 are directly applied to Inductive Logic Program- 
ming. On the other hand, if the converse holds, then the implication problem 
77 ^ 77 is decidable, because flat{lJ) and flat{D) are function- free. However, it 
contradicts the undecidability of the implication problem |8lit)| or the satisfia- 
bility problem 21 • In this paper, we show that the converse does not hold even 
if 77 = {C}, that is, there exist definite clauses C and 77 such that: 

C'^D hut flat{C) ^ flat{D). 

Furthermore, we investigate the conditions of C and 77 satisfying that C |= 77 
if and only if flat{C) |= flat{D). Gottlob 2j has introduced the concepts of self- 
resolving and ambivalent clauses. A definite clause C is self-resolving if C resolves 
with a copy of C, and ambivalent if there exists an atom in the body of C with 
the predicate symbol same as one of the head of C. As the corollary of Gottlob’s 
results 0, we show that, if C is not self-resolving and 77 is not tautological, or 
77 is not ambivalent, then the statement holds. Furthermore, note that the C in 
the counterexample stated above is given as a doubly recursive definite clause, 
that is, the body of C contains two atoms that are unifiable with the head of a 
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variant of C . Then, we show that, if C is singly recursive, that is, the body of 
C contains at most one atom that is unifiable with the head of a variant of C, 
then the statement also holds. 

2 Preliminaries 

A literal is an atom or the negation of an atom. A positive literal is an atom and 
a negative literal is the negation of an atom. A clause is a finite set of literals. 
A unit clause is a clause containing one positive literal. A definite clause is a 
clause containing one positive literal. A set of definite clauses are called a definite 
program. Conventionally, a definite clause is represented as A <— 
where A and Ai (1 < i < m) are atoms. 

Let C be a definite clause A ^ Ai , • • • , Am ■ Then, the atom A is called a 
head of C and denoted by head(C), and the sequence Ai, • • • , A^ of atoms is 
called a body of C and denoted by bodyiC). 

Let C and D be definite clauses. We say that C subsumes D, denoted by 
C y D, if there exists a substitution 9 such that C9 C D, i.e., every literal in C9 
also appears in D. Also we say that C implies D or I? is a logical consequence 
of C, denoted hy C \= D, if every model of C is also a model of D. C is logically 
equivalent to D, denoted by C = I?, if (7 |= I? and D \= C. For definite programs 
n and S,n>D,n>E,n\^D and U \= S are defined similarly. 

Let C and D be two clauses {Li, • • • , • • • , L;} and {Mi, • • • , Mj, • • • , M^} 

which have no variables in common. If the substitution 9 is an mgu for the set 
{Li,^Mj}, then the clause ((C — {Li}) U {D — {^Mj}))9 is called a (binary) 
resolvent of C and D. All of the resolvents of C and D are denoted by Res(C, D). 

Let il be a definite program and C be a definite clause. An SLD- derivation 
of C from 7T is a sequence (i?i, Co, 0i), . . . , (i?fc, Cfe_i, 0^) such that Rq G 7T, 
Rk = C, Ci_i is a variant of an element of 77, Ri G 77es(7?i_i, Ci_i), and 9i 
is an mgu of the selected literals of Ri-i and Ci_i for each 1 < i < fc. If an 
SLD-derivation of C from 77 exists, we write 77 h C. In particular, |C} h 77 is 
denoted by C h 77. 

Theorem 1 ((Subsumption Theorem p.3])). Let LI be a definite program 
and D be a definite clause. Then, LI \= D if and only if there exists a definite 
clause E such that II \- E and E > D. 

For a definite clause C, the Ith self-resolving closure of C, denoted by S\C), 
is defined inductively as follows: 

1. 5°(C) = |C}, 

2. S\C) = S^-\C) U (77 G Res(C, 77) | 77 G 5'"^(C)} (I > 1). 

Here, the logically equivalent clauses are regarded as identical. Note that C \~ D 
if and only if 77 G S^(C) for some I > 0. Then: 

Corollary 2 ((Implication between Definite Clauses [l2j)). Let C and 

77 be definite clauses. Then, C \= D if and only if there exists a definite clause 
E such that E G S\C) and E > D for some I >0. 
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For each n-ary function symbol /, the associated (n-l-l)-ary predicate symbol 
Pf, called a flattened predicate symbol (on /), is introduced uniquely in the 
process of flattening. Also we call a definite clause C or a definite program U 
regular if C or 77 contains no flattened predicate symbols. 

Let C be a definite clause, t be a term appearing in C and n be a variable 
not appearing in C. Then, C|( denotes the definite clause obtained from C by 
replacing all occurrences of t in C with v. 

There exist several variants (but equivalent) of the definition of flattening: 

1. Do we introduce an equality theory or not 

2. Do we transform a constant symbol to an atom with an unary predicate 
symbol Em or not ESI? 

As the definition of flattening, we adopt the definition similar as De Raedt and 
Dzeroski |2| that does not introduce an equality theory and does not transform 
a constant symbol. 

Let C be a definite clause. Then, the flattened clause flat{C) of C is defined 
as follows: 



where C = C]” U {^p/(ti, • • • , n)} and each (1 < z < n) is a vari- 
able or a constant. Also defs{C) is the set {p/(a;i, • • • , /(xi, • • • , | 

/(ti, • • • , t„) appears in C} of unit clauses. Furthermore, the number of calls of 
flat that is necessary to obtain the function-free clause flat{C) of C is called a 
rank of C and denoted by rank{C). 

For a definite program 77 = {Ci, • • • , Cn}, we define flat{II) and defs(U) as 
follows: 



3 Flattening and Implication 

As the relationship between flattening and subsumption, Rouveirol jl4j (and 
Nienhuys-Cheng and de Wolf M) has shown the following theorem: 

Theorem 3 ((Rouveirol ||14| . Nienhuys-Cheng &: de Wolf Let C 

and D he regular definite clauses. Then, C D if and only if flat(C) ^ flat{D). 

Also Rouveirol [T^ (and Nienhuys-Cheng and de Wolf m) has proposed the 
following relationship between flattening and implication. Let II he a, regular 
definite program and 77 be a regular definite clause. Then, Rouveirol’s Theorem 
is described as follows: 

Theorem 4 ((Rouveirol [14J . Nienhuys-Cheng &: de Wolf [13j)). Let LI 

he a regular definite program and D he a regular definite clause. Then, LI D 
if and only if flat (II) U defs{LI) |= flat{D). 

In Appendix, we discuss the proof of Rouveirol’s Theorem. 



flat{C) = I 



C if C is function- free, 

flat{C) if t = f{t\, ■ ■ ■ ,tn){n > 1) appears in C, 



flat{LI) = {flat(Ci), ■ ■ ■ ,flat{Cn)}, 
defs{n) = defs{Ci) U • • • U de/s(C„). 
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Furthermore, Nienhuys-Cheng and de Wolf m have shown the following 
theorem, which is a stronger relationship between flattening and implication 
than Rouveirol’s Theorem: 

Theorem 5 ((Nienhuys-Cheng &: de Wolf lll3| ii. Let II be a regular def- 
inite program and D he a regular definite clause. If fiat{n) ^ fiat(D), then 

n ^ D. 

On the other hand, the converse of Theorem 0 does not hold even if U = {C}: 
Theorem 6. There exist regular definite clauses C and D such that 
C^D butflat{C) ^flat{D). 

Proof. Let C and D be the following regular definite clauses: 

C = p{f{Xi),f{x2)) ^ p{Xi,X3),p{x3,X2), 

D = p{f{f{xi)), f{f{x 2 ))) ^ p{Xi,X3),p{x3,X4),p{x4,X5),p{X5,X2). 

By resolving C to a copy of C itself twice, it holds that C h ZD as Figure Q 
Hence, it holds that C |= ZD. 




P(f(f(Vl)),f(X2))^ P(f(V2),X2) ,P(Vl,V3),p(V3,V2) 

C = P(f(Zl),f(Z2)) ^p(Zl,Z3),p(Z3,Z2) 




P(f(f(yi)).f(f(z2)))«-p(yi,y3),p(y3,zi),p(zi,z3),p(z3,z2) 
= p(f(f(Xl)),f(f(X2)))<— p(Xl,X3),p(X3,X4),p(X4,X5),p(X5,X2) 



Fig. 1. The SLD-derivation of ZD from C 



On the other hand, flat{C) and fiat{D) are constructed as follows: 

flat{C) =p{xi,X 2 ) ^ p{x3,X4),p{x4,X5),Pf{x3,Xi),Pf{X5,X2), 
flat{D) =p{xi,X 2 ) ^ p{x 3 ,X 4 ),p{x 4 ,X 5 ),p{X 5 ,Xe),p{xe,X 7 ), 

Pf{x3,Xs),Pf{xs,Xi),Pf{x7,Xg),Pf{xg,X2). 

The first and second self-resolving closures oiflatiC) are constructed as FigureEl 
Then, there exists no definite clause E G S^{flat{C)) such that E >: flat{D). 
By paying our attention to the number of atoms with the predicate pf and its 
relation, it holds that flat{C) ^ flat{D) if and only if there exists a definite clause 
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S^{flat{C)) = {flat(C)} 

■ p{xi, X2)^p(X3, X4),p(x5, xe),p(xe, X7), 

Pf{X5,X8),Pf{X8,Xl),Pf{X4,X2),Pf{X7,X3) 

p{xi, X2)^p{X3, X4),P(X5, X6),p{xe, X 7 ), 

Pf{X3,Xl),Pf{X7,X8),Pf{X8,X2),Pf{X5,X4) 

SHflatiC)) = S\flat(C)) 

' p(xi, X2)^p{X3, X4),p{X4, X5),p(X6, X7),p{X7, Xg), 

Pf{X3,X9),Pf{X9,Xl),Pf{X8,Xw),Pf{xiQ,X2), 
Pf{X5,Xll),Pf{X8,Xll) 

P(X 1 , X2)^p{X3, X 4 ),P(X 5 , Xe),p{X 7 , X 8 ),p{X 8 , X 9 ), 

Pf(X7,Xlo),Pf(xw,Xll),Pf(xil,Xl),Pf(X4,X2), 
Pf{X6,X3),Pf{X9,X5) 

P{X 1 , X2)^p{X3, X4),P(X5, Xe),p(X7, X8),p(xs, X 9 ), 

U Pf(X5,Xlo),Pf(xio,Xl),Pf(x4,X2),Pf(X7,X6), 

P/(X9,Xll),Pf(xil,X3) 

P(X 1 , X2)^p(X3, X4),P(X5, Xe),p(X7, X8),p(xs, X9), 

Pf(X3,Xl),Pf(X6,Xlo),Pf(xio,X2),Pf(X7,Xil), 

Pf(xil,X4),Pf(xg,Xs) 

p(xi, X2)^p(X3, X4),P(X5, Xe),p(X7, X8),p(xs, X 9 ), 

Pf(X3,Xl),Pf(X9,Xlo),Pf(xio,Xll),Pf(xn,X2), 
Pf(X5,X4),Pf(X7,X6) 



. 



Fig. 2. The first and second self-resolving closures of flatiC) 



E G S^{flat{C)) such that E ^ flat{D) by Corollary|21 Hence, we can conclude 
that there exists no definite clause E G S'^{flat{C)) such that E ^ flat{D), so it 
holds that flat{C) flat{D). □ 

For the definite clauses C and D given in Theorem 0 it holds that {flat{C)} U 
defs{C) h flat{D) as Figure0 so it holds that {flat{C)} U defs{C) \= flat{D). 



4 Improvement 

In this section, we investigate the conditions of definite clauses C and D satis- 
fying that C \= D ii and only if flat(C) ^ fiat{D). 

First, we give the following lemma by Gottlob 0. A definite clause C is 
self-resolving if C resolves with a copy of C. A definite clause C is ambivalent 
if there exists an atom in body(C) with the predicate symbol same as one of 
head{C). Then: 

Lemma 7 ((Gottlob 0)). Let C and D be definite clauses. 

1. Suppose that C is not self-resolving and D is not tautological. Then, C \= D 
if and only if C ^ D. 

2. Suppose that D is not ambivalent. Then, C \= D if and only ifC^D. 
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flat{C) = P(X 1 ,X 2 )<— P(X3,X4) ,P(X4,X5),Pf(X3,Xl),Pf(X5,X2) 

-P(y3,y4),p(y4,y5),pf(y3,yi),pf(y5,y2) 



p(xi,X2)^ p(V2,X5) ,p(y3,y4),p(y4,y5),pf(yi,xi),pt(x5,X2),pf(y3,yi),pf(y5,y2) 

P(Z3,Z4),p(Z4,Z5),pf(Z3,Zl),pf(Z5,Z2) 



P(Xl,X2)e-p(V3,V4),P(V4,y5),P(Z3,Z4),P(Z4,Z5),Pf(Vl,Xl),Pf(Z2,X2),Pf(y3,yi),Pf(V5,Zl), Pf(Z3,Zl) ,Pf(Z5,Z2) 



P(Xl,X2)4-p(V3,V4),P(y4,y5),P(Z3,Z4),P(Z4,Z5),Pf(Vl,Xl),Pf(Z2,X2),Pf(y3,yi), Pt(V5,f(Z3)) ,Pf(Z5,Z2) 

ix))^ 



p(xi,X2)<-p(y3,y4),p(y4,y5),p(y5,Z4),p(z4,Z5),pf(yi,xi),pf(z2,X2),pf(y3,yi),pf(z5,Z2) 

= p(Xl,X2)e-p(X3,X4),p(X4,X5),p(X5,X6),p{X6,X7),pf(X8,Xl),pf(X9,X2),pf(X3,X8),pf(X7,X9) 

= flat{D) 



Fig. 3. The SLD-derivation of flat{D) from {flat{C)} U defs{C) in Theorem|3 

By incorporating Lemma 0 with the previous theorems, we obtain the following 
corollary: 

Corollary 8. Let C and D be regular definite elauses. 

1. Suppose that C is not self-resolving and D is not tautological. Then, C \= D 
if and only if flat (C) \=flat(D). 

2. Suppose that D is not ambivalent. Then, C \= D if and only if flat (C) |= 



Proof. 1. By Lemma Q C D if and only if C > D. By Theorem 01 C > D if 
and only if flat{C) >: flat{D). By the definition of ^ and |=, if flat(C) > flat{D) 
then flat{C) \= flat{D). So it holds that if C D then flat{C) |= flat{D). 
Hence, the statement holds by Theorem 0 

2. By the definition of ambivalence, the predicate symbol of the head of D 
is different from all of the predicate symbols appearing in the body of D. This 
condition is preserved in flat{D), because the flattened predicate symbols, which 







flat(D). 
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does not appear in D, are introduced only in the body of D. Hence, flat{D) is 
not ambivalent. By Lemma Q and Theorem 0 the statement is obvious. □ 

In TheoremEl C is given as a doubly recursive definite clause, that is, body{C) 
contains two atoms that are unifiable with head(C'), where C is a variant of C. 
In the remainder of this section, we restrict the form of C to singly recursive. 
Here, a definite clause C is singly recursive if hody{C) contains at most one atom 
that is unifiable with head{C'), where C is a variant of C. 

Lemma 9 ((Gottlob [3lJ)). IfC ^ D, then head{C) ^ head{D) and body{C) ^ 
body{D). 

Let C be a singly recursive definite clause. It is obvious that |5^(C)| < Z + 1 
and |5*"''^(C') — 5^(C)| < 1 for each I > 0. Then, the Ith self-resolvent Ci of C is 
defined inductively as follows: 

1. Co = C, 

^ ( DifS^+\C)-S‘{C) = {D}, 
i+i ^ otherwise. 

Lemma 10. Let C be a singly recursive regular definite clause with function 
symbols. Suppose that C contains a term t = f{t\, • • • , where each ti is either 
a variable or a constant. Also let C' be a definite clause C|( U{^p/(ti, • • • , u)}. 

Then, it holds that flat{Ci) = /?ot(C() for each I > 0. 

Proof. We show the statement by induction on Z. If Z = 0, then the statement is 
obvious, since Cq = C, Cq = C and fiat{C) = flat{C'). 

Suppose that the statement holds for Z < Zc. It is sufficient to show the 
case that C is of the form p(i) ^ p{s). Consider Cfc+i and By the def- 

inition of the (fc -I- l)th self-resolvent, Ck+i is a resolvent of C and Cfc, and 
is a resolvent of C and C(,. Then, we can suppose that Cfc+i is of the 
form {head{C) ^ body(Ck)p,)0, where 9 is an mgu of head{Ck)p, and body{C) 
and /i is a renaming substitution. Hence, is of the form (head(C') <— 

body{C'ff)p' ,pf{tx,- ■ ■ ,tn, v))6', where 9' is a substitution obtained from 9 by re- 
placing the binding tfi/x in 9 with v/x, and p! is a renaming substitution by 
adding the binding u/v {u is a new variable) to p. By induction hypothesis, it 
holds that flat{Ck) = fiat{C'if) and fiat{C) = fiat{C'). By the forms of Ck+i and 
C(._|_i, it holds that fiat{Ck+i) = flat{C'i^_^_^). □ 

Lemma 11. For a singly recursive definite clause C, it holds that flat(Ci) = 
{fiat{C))i for each I > 0. 

Proof. We show the statement by induction on rank{C). If rank(C) = 0, then 
the statement is obvious, because flat{Ci) = Ci and flat{C) = C for each Z > 0. 

Suppose that the statement holds for C such that rank{C) < k. Let C be 
a singly recursive definite clause such that rank{C) = k 1. Since C contains 
some function symbols, suppose that C contains the term t = /(ti, • • • , t„). 
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where each U is either a variable or a constant. Let C' be a definite clause 
C\fU{^Pf{ti, - ■ ■ ,tn,v)}. Then, it holds that flat{C') = flat{C) and rank(C') = 
k. By Lemma M it holds that flat{C'i) = flat{Ci) for each I > 0. By induction 
hypothesis, it holds that flat{Ci) = flat(C'i) = {flat{C'))i = {flat{C))i for each 
I > 0. Hence, the statement holds for rank{C) = k + 1. □ 

Lemma 12. For a singly recursive definite clause C, it holds that flat(C) |= 
flat (Cl) for each I > 0. 

Proof. We show the statement by induction on Z. If ^ = 0, then Cq = C, so the 
statement is obvious. 

Suppose that the statement holds for I < k. Since Ck+i is a resolvent of C 
and Ck and by Lemma[ni flat{Ck+i) is a resolvent of flat{C) and fiat{Ck). By 
the soundness of SLD-resolution (c/. |YI1,1) L it holds that {flat(C),flat{Ck)} |= 
fiat{Ck+i). By induction hypothesis, it holds that fiat{C) |= flat{Ck). Hence, it 
holds that flat{C) ^ fiat{Ck+i), so the statement holds for I = k + 1. □ 



Theorem 13. Let C be a singly recursive regular definite clause and D he a 
regular definite clause. Then, C \= D if and only if flat (C) \= flat(D). 

Proof. By Theorem El it is sufficient to show the only-if direction. We show it 
by induction of rank{D). If rank{D) = 0, that is, D is function-free, then so is C 
by LemmaEl Then, flat{C) = C and flat{D) = D, so the statement is obvious. 

Suppose that the statement holds for D such that rank(D) < k. Let D be 
a regular definite clause such that rank{D) = k + 1. Since D contains some 
function symbols, suppose that D contains a term t = /(ti, • • • ,tn), where each 
ti {1 < i < n) is a variable or a constant. Also let D' be a definite clause 
D\^ U {^pf(ti, ■ ■ ■ ,tn,v)}. Then, rank(D) = k and flat{D') = flat{D). Suppose 
that C \= D. Then, by Corollary 0 and the definition of the Zth self-resolvent, 
there exists an index I > 0 such that Ci> D. 

As similar as the proof of Lemma 19.6 in d, we can construct the definite 
clause C from C/ such that C > D' and flat{Ci) = flat{C') as follows: Suppose 
that Ci9 C D' . Let {si, • • • , Sm} be the set of distinct terms occurring in C/ such 
that Si9 = t. If Si is a variable, then replace the binding t/si with v/si. If si is 
of the form /(ri,- • • ,r„), in which case the Vj are variables or constants, then 
replace all occurrences of Si in C/ with a new variable Vi, add ^p/(ri, • • • , rn, Vi) 
in Cl, and add the binding v/vi to 9. We call the definite clause resulting from 
these m adjustments C . Finally, replace all occurrences of t in bindings in 9 
with y, and call the resulting substitution 9' . Then, it holds that C'9' C D' , so 
C > D' . Hence, C |= D' . Furthermore, by the construction of C , it holds that 
flat(Ci) = flat(C'). 

By induction hypothesis, it holds that flat{C) ^ flat{D'), so flat{Ci) \= 
flat{D). By Lemma IT^ it holds that flat{C) ^ flat{D). Hence, the statement 
holds for rank{D) — k + 1. □ 
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5 Conclusion 

In this paper, we have investigated the relationship between flattening and im- 
plication Ildll4l . Let 7T be a regular definite program and C and D be definite 
clauses. As the stronger relationship between flattening and implication than 
Rouveirol’s Theorem Nienhuys-Cheng and de Wolf have shown the 
following theorem: 

If flat{n) ^ flat{D), then U D. 

In this paper, we have shown that there exist definite clauses C and D such that: 
C'^D but flat{C) ^ flat{D). 

Furthermore, we have shown that if C and D satisfy one of the following condi- 
tions, then it holds that C |= H if and only if flat{C) ^ flat{D): 

1. C is not self-resolving and D is not tautological, 

2. D is not ambivalent, 

3. C is singly recursive. 

The class of definite clauses that flattening preserves implication is corre- 
sponding to the class that the implication problem is decidable EHZl, and the 
class of definite clauses that flattening does not preserve implication in the above 
sense is corresponding to the class that the implication problem is undecid- 
able nisi It is a future work to investigate the relationship between the classes 
of definite clauses that flattening preserves implication and that implication is 
decidable. 
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Appendix: Rouveirol’s Theorem 

For Rouveirol’s Theorem, it is clear that Rouveirol’s original proof m is in- 
sufficient. On the other hand, Nienhuys-Cheng and de Wolf H3| have shown 
Rouveirol’s Theorem as the consequence of Theorem Eland the following lemma: 

Lemma 14. Let II be a regular definite program. Then, flat{n)U defs {II) ^ II. 

However, we obtain the following theorem: 

Theorem 15. There exist regular definite clauses C and D such that 

CUD but {flat{C)} U defs{C) \f flat{D). 

Proof. Let C and D be the following regular definite clauses: 

C = p{f{xi,X3), f{x3,X2)) ^ p{xi,X3),p{x3,X2), 

D = p{f if (xi,X2), f{x2, X3)),f(f(X2,X3), f(x3, X4))) 

^ p(xi,X2),p(x2,X3),p(X2, X3),p(X3, X4). 

By resolving C with C itself twice, we can show that C h D. 

On the other hand, flat{C) and flat{D) are constructed as follows: 

flat{C) =p{xi,X2) ^ p{x 3 ,X 4 ),p{Xi,X 3 ),Pf{x 3 ,XA,Xi),Pf{xA,X 3 ,X 2 ), 
flat{D) =p{xi,X2) ^ p{x3,X4),p{Xi,X3),p{X4,X3),p{x3,XQ), 

Pf{x3,X4,X7),Pf{x4,X5,Xs),Pf{x5,Xe,Xg), 

Pf{x7, Xs, Xi),Pf{xs, Xg, Xg) . 

Also defs{C) = {pf{x,y,f{x,y)) 

The first and second self-resolving closures of flat{C) are constructed as Fig- 
ure 0 Note that {flat{C)} U defs{C) h flat{D) if and only if there exists a 
definite clause E £ S^{flat{C)) such that flat{D) is obtained by resolving E 
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S\flat{C)) = {flat{C)} 

{ p(xi, X2)^p(x3, X 4 ), p(x5, Xe) , p(xe, X 7 ), 

Pf(X5, X6, Xs),Pf{xs, X3, Xl),Pf{X3, X 4 , X 2 ),Pf{X 6 , X 7 , X 3 ) 
p{xi,X2)^p(X3,X4),p{X3,X6),p{x6,X7), 

Pf{X3, X4, Xl),Pf{xe, X7, Xs),Pf{X4, Xg, X2),Pf{X5, Xg, X 4 ) 

S\flat{C)) ^ S\flat{C)) 

' El : p(xi,X2)^piX3,X4),piX4,X5),p(X6,X7),p(X7,X8), 

Pf{X3,X4,X9),Pf{X9,Xl0,Xl),Pf{x7,Xs,Xll),Pf{xi0,Xll,X2), 

Pf{X4,X5,Xlo),Pf{xe,X7,Xlo) 

E 2 : p{xi, X2)^p{X3,X4),p{X4, X5),p{X6,X7),p{x7,Xs), 

Pf{X3,X4,Xlo),Pf{X9,Xlo,Xl),Pf{X4,X5,Xll),Pf{xio,Xil,X2), 

Pf(X7,X8,Xlo),Pf{X6,X7,Xg) 

p(xi,X2)^p(X3,X4),p(X5,X6},piX7,X8),p(xs,Xg}, 

Pf{xs,X3,Xl),Pf{X3,X4,X2),Pf{xil,X5,XlQ),Pf{x5,Xe,X3), 

) Pf{X7,X8,Xll),Pf{xs,Xg,X5) 

p{xi,X2)^piX3,X4),p{X5,X(i),p{X7,X8),p{x8,X9), 

Pf(xio,X3,Xl),Pf(X3,X4,X2),Pf(X5,X6,Xlo},Pf(X6,Xll,X3), 

Pf{X3,Xs,X6),Pf{X8,Xg,Xll) 

p(xi,X2)^p(X3,X4),p(X5,X6),p(X7,X8),p(X8,Xg), 

Pf{X3,X4,Xl),Pf{X4, ®10, X2),Pf{xil,X5, X4),Pf{xs, Xg, Ho), 
Pf{X7,X8,Xll),Pf{x8,Xg,X5) 
p{xi,X2)^piX3,X4),p{X5,X(i),p{X7,X8),p{x8,Xg), 

Pf{X3,X4,Xl),Pf{X4, * 10 , *2),P/(*5,*6,*4),P/(*6, * 11 , *lo), 
P/ (*7 , *8 , *e) , P/ (*8 , *9 , *11 ) 




Fig. 4. The first and second self-resolving closures of flat{C) 



with pf(x,y,f{x,y)) <— some times. Then, we cannot obtain the above E from 
each element in S^{flat{C)) except E\ and E 2 - Furthermore, the resolvent of 
Ei [i = 1,2) with pf{x,y,f{x,y)) <— twice, where the selected atoms in Ei are 
atoms of which the third argument’s term is *io, contains a term with /. Hence, 
it holds that {flat{C)} U defs{C) 1/ flat{D). □ 

Hence, we cannot directly conclude Rouveirol’s Theorem from Theorem 0 and 
Lemma d 

Note that the definite clauses C and D in Theorem d are not a counterex- 
ample of the if-direction of Rouveirol’s Theorem, because Ei and E 2 subsume 
flat(D) by the following substitutions cti and a 2 - 

ai = {x4lxQ,X7,lx7,Xiilxs,X7lxg,X8,lxw,Xglxii}, 

f*2 = {x4lx3,X7,lx4,Xiilx^,X4lx7,X7,IX8„X7lX9,Xslxw,Xglxii}. 

Rouveirol’s Theorem seems to be correct, but it is necessary to improve the 
proof by because of Theorem 0 and d 
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Abstract. This paper extends the traditional inductive logic program- 
ming (ILP) framework to a '0-term capable ILP framework. Ait-Kaci’s 
i/)-terms have interesting and significant properties for markedly widening 
applicable areas of ILP. For example, ■i/)-terms allow partial descriptions 
of information, generalization and specialization of sorts (or types) placed 
instead of function symbols, and abstract descriptions of data using sorts; 
they have comparable representation power to feature structures used 
in natural language processing. We have developed an algorithm that 
learns logic programs based on i/)-terms, made possible by a bottom-up 
approach employing the least general generalization (Igg) extended for ip- 
terms. As an area of application, we have selected information extraction 
(IE) tasks in which sort information is crucial in deciding the generality 
of IE rules. Experiments were conducted on a set of test examples and 
background knowledge consisting of case frames of newspaper articles. 
The results showed high precision and recall rates for learned rules for 
the IE tasks. 



1 Introduction 

In the traditional setting of inductive logic programming (ILP) jl 4j . the input is a 
set of examples, which are usually ground instances, and background knowledge, 
which is a set of ground instances or logic programs. The output of ILP systems is 
a set of logic programs, such as pure Prolog programs. The form {i.e., language) 
of the output is called a hypothesis language. The task of ILP systems is to find, 
based on the background knowledge, good hypotheses that cover most positive 
examples and least negative examples (if any). 

Previously, as one direction of extending the scope of the representation 
power of examples and a hypothesis language of ILP, RHB+ was presented 
for learning logic programs based on r-terms which are logic terms whose vari- 
ables have sorts (or types), r-terms, however, are a very restricted form of ip- 
terms used in LOGIN 0 and LIFE [^. 

For example, in the previously proposed framework, a positive example that 
expresses “Jack was injured” was represented as 

injured{agent Jack), 



O. Watanabe, T. Yokomori (Eds.): ALT’99, LNAI 1720, pp. 169-^^3 1999. 
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using a feature (or attribute) agent. If Jack is defined as a sub-sort of people, 
this example could be generalized to 

injured{agent people) 



However, the example 

injured{agent => passenger{count => 10)) 

could not be generalized to 

injured{agent people{count => number)). 

This is because RHB"^ would treat the passenger as a function symbol and 
would not be able to generalize the passenger to the sort people. This restricted 
the application range which had to learn from the data as to which of the original 
structures of natural language sentences to preserve. 

This paper presents the design and algorithm of our new ILP system which 
is capable of handling '0-terms. After explaining attractive points of '0-terms, 
we formally define '0-terms and explain their properties. Then previous type- 
oriented learner is briefly described and an algorithm for achieving an ILP sys- 
tem that learns logic programs based on '0-terms is presented. As an application 
to test the feasibility of our system, information extraction (IE) is briefly intro- 
duced. After that, experimental results on IE tasks are shown. A discussion and 
conclusions conclude this paper. 



2 Attractive Points of '0-Terms 

For the sake of introducing of features (or attributes) and sorts, 0-terms enable 

the following advantages. 

partial descriptions For example, term name{first => peter) expresses the 
information of a person whose first name is known. This is equivalent to 
name{first^peter,last^T). In the process of unification^, possibly other 
features can be added to the term. 

dynamic generalization and specialization Sorts, placed at the positions 
of function symbols, can be dynamically generalized and specialized. For 
example, personfid name(first person)) is a generalized form of 
man{id name{first Jack)). 

abstract representation Abstract representations of examples using sorts can 
reduce the amount of data. For example, 

familiar (agent person(residence France), obj French) 

represents a number of ground instances, such as, 

familiar (agent Serge (residence => France), obj French). 
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coreference Coreference enables the recursive representation of terms. For ex- 
ample, X : person{spouse person{spouse X)) refers to itself recur- 
sively B 

NLP applicability '0-terms have the same representation power as feature 
structures 0 which are used for formally representing the syntax and se- 
mantics of natural language sentences. 

3 t/t-Terms 

Given a set of sorts S containing the sorts T and _L, the partial order < on 5 
such that _L is the least and T is the greatest element, and features {i.e., labels 
or attributes) IF, -0-terms are defined as follows 0. 

3.1 Ordered Sorts 

Sorts S have a partial order <. t\ < T 2 means that T 2 is more general than t\ 
0 . A set of sorts S must include the greatest element T and the least element _L. 
s V t is defined as the supremum of sorts s and t, and s A t is the infimum of sorts 
s and t. Then, {S, <, V, A) forms a lattice jOl. 

If the given sort hierarchy is a tree without T and _L, we add r < T for root 
sort r and _L < 0 for leaf sort 0. As a special treatment, we distinguish constants 
from sorts when we have to distinguish them for ILP purposes, while constants 
are usually regarded as sorts. Formally, constants C is C C 5 and for c G C, 
Vt t < c D t = -L. 

3.2 Definition of '0-Terms 

Informally, 0-terms are Prolog terms whose variables are replaced with variable 
V ar of sort s, which is denoted as Var:s. Function symbols are also replaced 
with sorts. Terms have features (labels or attributes) for readability and for 
representing partial information. 

For example. 



injured{agent X : peopleipf number)) 

is an atomic formula based on 0-terms whose features are agent and of and 
whose sorts are people and number. 

The recursive definition of 0-terms is as follows. 

Definition 1 {if-terms) A ijj-term is either an untagged 0-term or a tagged 
0-term. 



^ Variables are also used as coreference tags. This is one of the most elegant ways to 
represent coreference. 

^ This is defined as T 2 C ri in [^. 
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Definition 2 {tagged ip-term) 

— A variable is a tagged ip-term. 

— If X is a variable and t is an untagged ip-term, X : t is a tagged ip-term. 



Definition 3 {untagged ip-term) 

— A sort symbol is an untagged ip-term. 

— If s is a sort symbol, li, ..., In are features and ti, ..., are ip-terms, s(l\ 
t\,...,ln tn) is an untagged ip-term. 



Definition 4 {atomic formula) 

— If p is a predicate, h,...,ln are features and ti,...,tn are ip-terms, p(l\ 
ti,...,ln =k tn) is an atomic formula. 

While terms have features, they are compatible with the usual atomic for- 
mulae of Prolog. The first-order term notation p{t\, ...,tn) is syntactic sugar for 
the Ip-term notation p{\ => t\, ...,n tn) Q. 

3.3 Least General Generalization 

In the definition of the least general generalization (Ing)ji) 7]. the part that defines 
the term Igg should be extended to the Igg of 'i/;-terms u- 

Now, we operationally define the Igg of '0-terms using the following notations. 
a and b represent untagged '0-terms, s, t, and u represent '0-terms. /, g, and h 
represent sorts. X, Y, and .Z represent variables. Given t = f{li t\,...,ln => 
tn), the li projection of t is defined as t.li = ti. For simplicity of the algorithm, 
we regard untagged '0-term a appearing without a variable to type V : a, where 
P is a fresh variable. Note that a variable appearing nowhere typed is implicitly 
typed by T. 

Definition 5 {Igg of ip -terms) 

1. lgg{X : a, X : a) = X : a. 

2. lgg{s,t) = u, where s^t and the tuple {s,t,u) is in the history Hist. 

3. If s = X : f{lf ^ si,..,Z® ^ s„), t = Y : g{l{ ^ ^m) and s^t, 

then lgg{s, t) = u, where L = {If, ..., Z® } n {l{, ..., and for features U G L, 
u = Z : h{h ^ lgg{s.li,t.li),...Jm lg9{s.l\L\,t.l\L\)) with = / V £0. 
{s,t,u) is added to Hist. 

® The Igg of 0-terms has already been described in . We present an algorithm as an 
extension of Plotkin’s Igg. The Igg of feature terms, which are equivalent to 0-terms, 
can be found in m- 

^ Here, V means the supremum of two sorts. 
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For example, the Igg of 

injured{agent => passenger{of 10)) 



and 



injured{agent men{of ^ 2)) 



is 



injured{agent ^ people{of ^number)). 



Definition 6 (Igg of atoms based on ip-terms) Let P and Q be atomic formulae 
(atoms). The Igg of atoms is defined as follows. 

1 . If P — p(^i ^n) and Q — p(^i Im) ; then 

IggiP^Q) = p{h ^ lgg{s.h,t.li),...,fn ^ lgg{sd\L\,t-l\L\)) where L = 

2. If P — P(^l "^n); Q — g(W "1 I'm) and P ^ Q} 

lgg(P, Q) is undefined. 

Let P and Q be atoms and Li and L 2 be literals. The Igg of literals and 
clauses are defined as follows Ell- 

Definition 7 (Igg of literals based on ip-terms) 

1. If Li and L 2 are atomic, then lgg(Li, L 2 ) is the Igg of atoms. 

2. If Li = ~^P and L 2 = ~^Q, then lgg(Li,L 2 ) = lgg(^P,^Q) = ~^lgg(P,Q). 

3. If L\ = ^P and L 2 = Q or L\ = P and L 2 = ~^Q, then lgg(Li, L 2 ) is 
undefined. 



Definition 8 (Igg of clauses) 

Let clauses c\ = and C 2 = {Ki, ..., K^}. Then lgg(c\,C 2 ) = 

{Lij = lgg(Li, Kj)\Li G ci,Kj € C 2 and lgg(Li,Kj) is defined}. 



4 Previous Type-Oriented ILP System 

This section briefly describes a summary of a previous ILP system. RHB+ learns 
logic programs based on r-terms. r-terms are restricted forms of ip-tevTos. Infor- 
mally, T-terms are Prolog terms whose variables are replaced with Vardype and 
whose function symbols have features. It employs a combination of bottom-up 
and top-down approaches, following the result described in m 

In the definition of the least general generalization (Igg) [II 7) . the definition 
of the term Igg was extended to the r-term Igg. The other definitions of Igg were 
equivalent to the originals. 

The special feature of RHB+ is the dynamic type restriction by positive exam- 
ples during clause construction. The restriction uses positive examples currently 
covered in order to determine appropriate types. For each variable Xi appearing 
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in the clause, RHB”'' computes the supremum of all types bound to Xi when 
covered positive examples are unified with the current head in turn. 

The search heuristic PWI is a weighted informativity employing a Laplace 
estimate Let T = {Head :—Body } U BK . 

where |P| denotes the number of positive examples covered by T and Q{T) is 
the empirical content. 

Type information was made use of for computing |Q(T)|. Let Hs be a set 
of instances of Head generated by proving Body using backtracking. |r| was 
defined as the number of constants under type r in the type hierarchy. When r 
is a constant, |r| is defined as 1. 

\q{t)\=y. n 1 ^ 1 ’ 

hGHs TGTypes(h) 

where Types(h) returns the set of types in h. 

The stopping condition also utilized |Q(T)| in the computation of the Model 
Covering Ratio (MCR): 



RHB+ was successfully applied to the IE task of extracting key information 
from 100 newspaper articles related to new product release ED- This implies a 
potential of applying our i/j-term capable ILP system to the IE task. 

5 New ILP Capable of '0-Term 

This section describes a novel relational learner '0-RHB which learns logic pro- 
grams based on "i/j-terms. 

Extending ILP to a ■0-term capable ILP is not straightforward. In top-down 
learning, the learner constructs all possible literals to be added to the current 
body. When considering only simple sorts which do not have any arguments, 
top-down approaches, like Foil, are efficient. However, when it comes to learning 
clauses with ^/>-terms, it is not realistic to produce all kinds of literals that contain 
possible terms. 

For example, if we have predicate p, sorts ti and t 2 , features h and I 2 , and 
variable X, one of the possible literals is p(/i=> ti(^ 2 =k X:ti{li^ ^ 2 ))) because 
the predicate arity and term depth are unbound. The maximum depth of modi- 
fication to the sort in a possible literal must be more than the maximum depth 
of modification to a sort seen in the given training examples. Moreover, at each 
level of a modification to a sort, the maximum number of features of the sort 
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is the number of features T . Therefore, generating all patterns of literals with 
i/j-terms is too time consuming and so not very practical. To cope with this 
problem, the learning strategy should be bottom-up. 

5.1 Algorithms of r/^-Term Capable ILP 

The positive examples are atomic formulae based on '0-terms. The hypothesis 
language is a set of Horn clauses based on -0-terms. The background knowledge 
also consists of atomic formulae. '0-RHB, a '0-term capable ILP system, employs 
a bottom-up approach, like Golem uni 



Learning Algorithm 

The learning algorithm of our ILP system is based on the Golem’s algorithm 
m extended for 0-terms. 

The steps are as follows. 

Algorithm 1 Learning algorithm 

1. Given positive examples P, background knowledge BK. 

2. Link sorts which have the same names in P and BK . 

3. A set of hypotheses H ={}. 

4-- Select K pairs of examples (Ai,Bi) as EP (Q <i < K). 

5. Select sets of literals ARi and BRi as selected background knowledge accord- 
ing to the variable depth D. 

6. Compute Iggs of clauses Ap.—ARi and Bp— BRi. 

1. Simplify the Iggs by evaluating with weighted informativity PWI, which is 
the informativity defined in Section^ 

8. Select the best clause C , and add it to H if the score of C is better than the 
threshold 6. 

9. Remove covered examples from P. 

10. Lf P is empty then return H; Otherwise, goto Step 0. 

In Step 2, we have to link sorts in the examples and the background knowl- 
edge because the OSF theory ^ which underlies the theory of 0-terms is not 
formed under the unique name assumption. For example, if we have two terms 
f{t) and g{t), t in f{t) and t in g{t) are not identical. The OSF theory requires 
that they be represented as /(A : t) and f{X : t), if t is identical in both of two 
terms. Therefore, the same sort symbols in the examples and in the background 
knowledge are linked and will be treated as identical symbols in the later steps. 

In Step 5, to speed up the learning process, literals related to each pair of 
examples are selected. At first, ARi and BRi are empty. Then, (1) select the 
background knowledge literals Agei so that it has all literals whose sort symbols 
are identical to the sorts in Ai or ARi, and select Bsei in the same manner using 
Bi or BRi. (2) Add literals Asei and Bgei to sets ARi and BRi, respectively. 
Repeat (1) and (2) n times when the predefined variable depth is n. This iteration 
creates sets of literals. 
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We use ARi as the selected background knowledge for Ai, and BRt for Bi in 
Step 6. What is computed in Step 6 is the following Igg of clauses. 

lgg{{A, /\AR,),{B, /\BR,)), 

where /\ 5 is a conjunction of all of the elements in S. The Iggs of clauses have 
the variable depth of at most n. 

In Step 7, simplification of the Iggs is achieved by checking all literals in 
the body as to whether removal of literals makes the score of the weighted 
informativity worse or not. For the purpose of informativity estimation, we use 
the concept of ground instances of atomic formulae based on i/j-terms. We call 
atomic formula A ground instance if all of the sorts appearing in A are constants. 
For example, 

f amiliar {agent => Serge{residence =k France), obj => French). 

Moreover, literals in the body are checked as to whether they satisfy the input- 
output mode declarations of the predicates. 

6 NLP Application 

6.1 Information Extraction 

This section presents a brief introduction of our target application: information 
extraction (IE). The task of information extraction involves extracting key in- 
formation from a text corpus, such as newspaper articles or WWW pages, in 
order to fill empty slots of given templates. Information extraction techniques 
have been investigated by many researchers and institutions in a series of Mes- 
sage Understanding Conferences (MUC), which are not only technical meetings 
but also IE system contests on information extraction, conducted on common 
benchmarks. 

The input for the information extraction task is a set of natural language 
texts (usually newspaper articles) with an empty template. In most cases, the 
articles describe a certain topic, such as corporate mergers or terrorist attacks 
in South America. The given templates have some slots which have field names, 
e.g., “company name” and “merger date”. The output of the IE task is a set 
of filled templates. IE tasks are highly domain dependent because the rules and 
dictionaries used to fill values in the template slots depend on the domain. 



6.2 Problem in IE System Development 

The domain dependence has been a serious problem for IE system developers. 
As an example, Umass/MUC-3 needed about 1500 person-hours of skilled labor 
to build the IE rules represented as a dictionary !E|. Worse, new rules have to 
be constructed from scratch when the target domain is changed. 
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Fig. 1. Block diagram of the preparation of data for ILP 



To cope with this problem, some researchers have studied methods to learn 
information extraction rules. On this background, we selected the IE task for an 
application of a ^/>-term capable ILP. An IE task is appropriate for our application 
because natural languages contain a vast variety of nouns relating to a taxonomy 
{i.e., sort hierarchy). 



7 Experimental Results 

For the purpose of estimating the performance of our system, we conducted 
experiments on the learning of IE rules. The IE tasks here involved the MUC- 
4 style IE and the template elements to be filled included two items 0. We 
extracted articles related to accidents from a one-year newspaper corpus written 
in Japanese 0. Forty two articles were related to accidents which resulted in some 
deaths and injuries. The template we used consisted of two slots: the number 
of deaths and injuries. We filled one template for each article. After parsing the 
sentences, tagged parse trees were converted into atomic formulae representing 
case frames. 

Figure H shows the learning block diagram. Those case frames were given 
to our learner as background knowledge. All of the J^.2 articles were able to be 
represented as case frames for the sake of the representation power of if -terms, 
while only 25 articles were able to be represented using r-terms. Each slot of a 
filled template was given as a positive example. For the precision and recall, the 
standard evaluation metrics for IE tasks, we counted them by using four-fold 
cross validation on the 42 examples. 



® This is a relatively simple setting compared to state-of-the-art IE tasks. 

® We thank the Mainichi Newspaper Co. for permitting us to use the articles of 1992. 
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Table 1. Comparison between RHB'^ and tjj-RHB 



[deaths] 


Time (sec) 


\Hypo\ 


1^1 / \Q{T)\ 


RHB+ 


172.7 


4 


25/25 


ip-RHB 


954.2 


4 


25/25 



[injuries] 


Time (sec) 


\Hypo\ 


|R| / \Q{T)\ 


RHB+ 


508.5 


8 


25/25 


ip-RHB 


3218.0 


10 


25/25 



correct output answers 

Precision = 

output answers 

correct output answers 

Recall = 

all correct answers 



7.1 Natural Language Processing Tools 

We used a sort hierarchy hand-crafted for the Japanese-English machine trans- 
lation system ALT-J/E |l I) . This hierarchy is a sort of concept thesaurus repre- 
sented as a tree structure in which each node is called a category (he., a sort). 
An edge in the tree represents an is -a relation among the categories. The current 
version of the sort hierarchy is 12 levels deep and contains about 3000 category 
nodes. We also used the commercial-quality morphological analyzer, parser, and 
semantic analyzer of ALT-J/E. 

Results after the semantic analysis were parse trees with case and semantic 
tags. We developed a logical form translator FEP2 which generates case frames 
expressed as atomic formulae from the parse trees. 

7.2 Results 

Table □ and Table 0 show the results of our experiments. Table ^ shows the 
experimental results of RHB+ and '0-RHB using the same 25 examples as used 
in P). We used a SparcStation 20 for this experiment. i/j-RHB showed a high 
accuracy like RHB''" but slowed down in exchange for its extended representation 
power in the hypothesis language. 

Table El shows the experimental results on forty two examples. We used a 
AlphaStation 500/333MHz for this experiment. Overall, a very high precision, 
90-97%, was achieved. 63-80% recall was achieved with all case frames including 
errors in the case selection and semantic tag selection. These selections had an 
error range of 2-7%. With only correct case frames, 67-88% recall was achieved. 

It is important to note that the extraction of two different pieces of informa- 
tion showed good results. This indicates that our learner has high potential in 
IE tasks. 
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Table 2. Learning results of accidents 





deaths 


injuries 


Precision (all case frames) 


96.7% 


89.9% 


Recall (all case frames) 


80.0% 


63.2% 


Recall (correct case frames) 


87.5% 


66.7% 


Average time (sec.) 


966.8 


828.0 



8 Discussion 

The benefits of '0-term capability, not just involving the r-term, in an appli- 
cation, depend on the writing style of the topic. In English, the expression 
“ABC Corp.’s printer” is commonly used and the logical term representation 
can be printer {pos => “ABC Corp.”). However, if the expression “ABC Corp. 
released a printer and ...” were very common, it could be reZease(“ABC Corp.”, 
printer). In this case, since the required representation is within the r-term, 
extending the language from r-term based to i/j-term based does not pay for the 
higher computing cost. 

In the IE community, some previous research has looked at generating IE 
rules from texts with filled templates, for example, AutoSlog-TS fE], CRYSTAL 
LIEP PUj, and RAPIER |E]. The main differences between our approach 
and the others are that 

— we use semantic representations (i.e. case frames) created by a domain- 

independent parser and semantic analyzer. 

— we use ILP techniques independent of both the parser and semantic analyzer. 

Note that the second item means that learned logic programs may have 
several atomic formulae in the bodies. This point differentiates our approach 
from the simple generalization of a single case frame. 

INDIE PI learns a set of feature terms equivalent to i/j-terms. The learning 
is equivalent to the learning of a set of atomic formulae based on i/j-terms which 
cover all positive examples and no negative examples. Because its hypotheses 
are generated so as to exclude any negatives, it might be intolerant to noise. 

Sasaki m reported a preliminary version ILP system which was capable of 
limited features of V'-terms. Preliminary experiments are conducted on learning 
IE rules to extract information from only twenty articles. 



9 Conclusions and Remarks 

This paper has described an algorithm of a 'i/;-term capable ILP and its appli- 
cation to information extraction. The Igg of logic terms was extended to the Igg 
of '0-terms. The learning algorithm is based on the Igg of clauses with '0-terms. 
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Natural language processing relies on a vast variety of nouns relating to the sort 
hierarchy (or taxonomy) which plays a crucial role in generalizing data generated 
from the natural language. Therefore, the information extraction task matches 
the requirements of the ■0-term capable ILP. 

Because of the modest robustness and performance of current natural lan- 
guage analysis techniques (for Japanese texts), errors were found in parsing, case 
selection, and semantic tag selection. The experimental results, however, show 
that learned rules achieve high precision and recall in IE tasks. Moreover, an 
important point is that all of the 42 articles related to the topic were able to 
be represented as case frames, which demonstrates the representation power of 
'0-terms. This indicates that applying ILP to the learning from case frames will 
become more practical as NLP techniques progress in the near future. 
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Abstract. Some authors have repeatedly pointed out that the use of 
the accuracy, in particular for comparing classifiers, is not adequate. 
The main argument discusses the validity of some assumptions underly- 
ing the use of this criterion. In this paper, we study the hardness of the 
accuracy’s replacement in various ways, using a framework very sensi- 
tive to these assumptions: Inductive Logic Programming. Replacement 
is investigated in three ways: completion of the accuracy with an addi- 
tional requirement, replacement of the accuracy by the ROC analysis, 
recently introduced from signal detection theory, and replacement of the 
accuracy by a single criterion. We prove strong hardness results for most 
of the possible replacements. The major point is that allowing arbitrary 
multiplication of clauses appears to be totally useless. Another point is 
the equivalence in difficulty of various criteria. In contrast, the accuracy 
criterion appears to be tractable in this framework. 



1 Introduction 

As the number of classification learning algorithms is rapidly increasing, the 
question of finding efficient criteria to compare their results is of particular rel- 
evance. This is also of importance for the algorithms themselves, as they can 
naturally optimize directly such criteria to achieve good results. A criterion fre- 
quently encountered to address both problems is the accuracy, which received 
recently on these topics some criticisms about its adequacy [7]. 

The primary inadequacy of the accuracy stems from a tacit assumption that the 
overall accuracy controls by-class accuracies, or similarly that class distributions 
among examples are constant and relatively balanced [6]. This is obviously not 
true : skewed distributions are frequent in agronomy, or more generally in life 
sciences. As an example, consider the human DNA, in which no more than 6% 
are coding genes [7]. In that cases, the interesting, unusual class is often the 
rare one, and the well-balanced hypothesis may not lead to discover the unusual 
individuals. Moreover, in real-world problems, not only is this assumption false, 
but also of heavy consequences may be the misclassification of some examples, 
another cost which is not integrated in the accuracy. Fraud detection is a good 
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example of such situations [7], but medical domains are typical. As an example, 
consider the case where a mutagen molecule is predicted as non-mutagen, and 
the case where an harmless molecule is predicted as mutagen. In that cases, the 
interesting class has the heaviest misclassification costs, and the equal error costs 
assumption may produce bad results. Finally, the accuracy may be inadequate in 
some cases because other parameters are to be taken into account. Constraints 
on size parameters are sometimes to be used because we want to obtain small 
formulae, for interpretation purposes. As an example, consider again the problem 
of mutagenesis prediction, where two equally accurate formulae are obtained. If 
one is much smaller, it is more likely to provide useful descriptions for the mining 
expert. 

We have chosen for our framework a field particularly sensitive to these prob- 
lems, Inductive Logic Programming (ILP). ILP is a rapidly growing research 
field, concerned by the use of variously restricted subclasses of Horn clauses to 
build Machine Learning (ML) algorithms. According to [9], almost seventy ap- 
plications use ILP formalism, twenty of which are science applications, which 
can be partitioned into biological (four) and drug design (sixteen) applications. 
ILP-ML algorithms have been applied with some success in areas of biochemistry 
and molecular biology [9] . Using ILP formalism, we argue that the replacement 
of the accuracy raises structural complexity issues. The argument is structured 
as follows. 

First, to address the latter problems, we explain that the single accuracy require- 
ment can be completed by an additional requirement to provide more adequate 
criteria. We integrate various constraints over two important kinds of param- 
eters: by-class error functions, and representation parameters such as feature 
selection ratios, size constraints. We show that any of such integration leads to 
a very negative structural complexity result, similar to A^P-Hardness, which is 
not faced by the accuracy optimization alone. The result has a side effect which 
can be presented as a “loss” in the formalism’s expressiveness, a rare property 
in classical ML complexity issues. Indeed, it authorizes the construction of ar- 
bitrary large (even exponential sized) sets of Horn clauses, but which we prove 
having no more expressive power than a single Horn clause. We prove a threshold 
in intractability since it appears immediately with the additional requirement, 
and is not a function of the tightness of it. Furthermore, the effects of the con- 
straints on optimal accuracies vanish as the number of predicates increases, as 
optimal accuracies with or without the additional constraints are asymptotically 
equal. Finally, for some criteria, their mixing with the accuracy brings the most 
negative result: not only does the intractability appears immediately with the 
criterion, but also the error cannot be dropped down under that of the unbiased 
coin. We then study the replacement of the accuracy criterion using a general 
method [6, 7], derived from statistical decision theory, based on a specific bi- 
criteria optimization. We show that this method leads to the same drawbacks. 
Finally, we investigate the replacement of the error by a single criterion, and 
show that it is also to be analyzed very carefully, as some of the “candidates” 
lead exactly to the same negative results presented before. The reductions are 
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presented for a subclass of Horn formalism simple enough to be an element of 
the intersection of all classically encountered theoretical ILP studies. 

2 Mono and Bi-criteria Solutions to Replace the 
Accuracy 

Denote as C and Ti. two classes of concepts representations, respectively called 
target’s class and hypothesis class. In real-world domains, we do not know the 
target concept’s class, that is why we have to make ad hoc choices for 7d with a 
powerful enough formalism, yet ensuring tractability. Even if some benchmarks 
problems appear to be easily solvable [3], ML applications, and particularly 
ILP, face more difficult problems [9], for which the choice of 7d is crucial. Since 
most of the studies dealing with the accuracy replacement problem have been 
investigated with two classes [7], we also consider two-classes problems and not 
multi-class cases. It is not really important for us, as results already become hard 
in that setting. Let c G C. Suppose that we have drawn examples following some 
unknown but fixed distribution D, labelled according to c. We can denote the 
accuracy of h € H with respect to (w.r.t.) c by Poih = c) = X)h(a;)=c(a:) D{x). 



2.1 Extending the Accuracy 

The principal drawbacks of the accuracy are of two kinds: the equal costs as- 
sumption [6], and the well balanced assumption [7]. We propose a solution to 
the problem by the maximization of the accuracy subject to constraints. We 
also propose criteria on related problems, an example of such being the feature 
selection problem, in which we want to build formulae on restricted windows 
of the total features set. For any fixed positive rational v, we use the follow- 
ing adequate notion of distance between two reals u,v : du(u,v) = . We 

also use eight rates on the examples (definitions differ slightly from [7]): TP = 
Emx)=i=c(.) D{x) ; TPR = TP/P ; FP = D{x) ; FPR = FP/N- 

TN = EmD=o=c(D D{x) ■ TNR = TN/N ; FN = EmD=o^c(.) D{x) ; FNR = 
FN/P, with N = Ec(a;)=o -^(^) P ~ Ec(a;)=i Order to complete 
the accuracy requirements, we imagine seven types of additional constraints, 
each of them being parameterized by a number C (between 0 and 1). Each of 
them defines a subset of TL, which shall be parameterized by D if the distribution 
controls the subset through the constraint. The first three subsets of Ti contain 
hypotheses for which the FP and FN are not far from each other, or a one- 
side error is upper bounded: = {h G T-l\du{FP,FN) < C};7fD,2(C) = 

{h G TL\FN < C};^D,3(C) = ^hGTL\FN < The two following sub- 

sets are parameterized by constraints equivalent to some frequently encountered 
in the information retrieval community [8], respectively (1 minus) the preci- 
sion and (1 minus) the recall criteria: TLda{ 0 = {h G TL\FP/{TP + FP) < <^} 
; htDyiO = {h GH\FN/{TP + FN) < (}. Define #P(/i) as the total num- 
ber of different predicates of h, #W(/i) as the whole number of predicates of 
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/i(if one predicate is present k times, it is counted k times), and as the 
total number of different available predicates. The two last subsets of Ti, are 
parameterized by formulae respectively having a sufficiently small fraction of 
the available predicates, or having a sufficiently small overall size: 'He(C) = 
{h G 7f|#P(/i)/#T < ^};7f7(C) = {h G 7f|#W(/i)/#T < ^}. The division by the 
total number of different predicates in Ti- 7 {C) is made only for technical reasons: 
to obtain hardness results for small values of C and thus, already for small sizes 
of formulae (in the last constraint). The first problem we address can be sum- 
marized as follows: 

Problem 1 : Given and a G {1,2,. ..,7}, can we find an algorithm re- 
turning a set of Horn clauses from 'H(D,)a{C) whose error is no more 
than a given 7, if such an hypothesis exists ? 



2.2 Replacing the Accuracy: The ROC Analysis 

Receiver Operating Characteristic (ROC) analysis is a traditional methodology 
from signal detection theory [1]. It has been used in machine learning recently 
[6, 7] in order to correct the main drawbacks of the accuracy. In ROC space (this 
is the coordinate system), we visualize the performance of a classifier by plot- 
ting TPR on the Y axis, and FPR on the X axis. Figure ^ presents the ROC 
analysis, along with three possible outputs which we present and analyze. If a 




Fig. 1. The ROC analysis of a learning algorithm. 



classifier produces a continuous output (such as an estimate of posterior prob- 
ability of an instance’s class membership [7]), for any possible value of FPR, 
we can get a value for TPR, by thresholding the output between its extreme 
bounds. If a classifier produces a discrete output (such as Horn clauses), then 
the classifier gives rise to a single point. If the classifier is the random choice 
of the class, either (if it is continuous) the curve is the line y = x, or (if it 
is discrete) there is a single dot, on the line y = x. One important thing to 
note is that the ROC representation gives the behavior of an algorithm without 
regarding the class distribution or the error cost [6]. And it allows to choose 
the best of some classifiers, by the following procedure. Fix as A+ the cost 
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of misclassifying a positive example, and K~ the cost of misclassifying a neg- 
ative example (these two costs depend on the problem). Then the expected 
cost of some classifier represented by point (FPR,TPR) is given by the follow- 
ing formula: I]c(a:)=i ^ “ TPR) x K+ + J2c{x)=o^i^) ^ FPR x K~ . 

Two algorithms, whose corresponding point are respectively {FPR\, TPR\) and 
(FPi?2, TPi?2), have the same expected cost iff (TPR 2 — TPRi)/{FPR 2 — 
FPRi) = This gives the slope of an 

isoperformance line, which only depends on the relative weights of the exam- 
ples, and the respective misclassification costs. Given one point on the ROC, the 
classifiers performing better are those on the “northwest” of the isoperformance 
line with the preceding slope, and to which the point belongs. If we want to 
find an algorithm A performing surely better than an algorithm B, we therefore 
should strive to find A such that its point lies into the rectangle whose opposite 
vertices are the (0, 1) point (the perfect classification) and B’s point (a grey 
rectangle is shown on the top left of figure Q. From that, the second problem 
we address is the following (Note the constraint’s weakness : the algorithm is 
required to work only on a single point) : 

Problem 2 : Given one point (TPRx,FPRx) on the ROC, can we find 
an algorithm returning a set of Horn clauses whose point falls into 
the rectangle with opposite vertices (0,1) and {TPRx,FPRx), if such 
an hypothesis exists ? 

2.3 Replacing the Accuracy by a Single Criterion 

The question of whether the accuracy can be replaced by a single criterion instead 
of two (such as in ROC) has been raised in [6]. Some researchers [6] propose the 
use of the following criterion: (1 — FPR) x TPR. A geometric interpretation of 
the criterion is the following [6] : it corresponds to the area of a rectangle whose 
opposite vertices are {FPR, TPR) and (1,0). The typical isoperformance curve 
is now an hyperbola. The third problem we address is therefore: 

Problem 3 : Given 7, can we find an algorithm returning a set of Horn 
clauses such that (1 — FPR) x TPR > 7, if such an hypothesis exists ? 

3 Introduction to the Proof Technique 

We present here the basic ILP notions which we use, with a basic introduction 
to our proofs. Technical parts are proposed in two appendices. 



3.1 ILP Background Needed 

The ILP background needed to understand this article can be summarized as 
follows. More formalization and details are given in [4], but they are not needed 
here. Given a Horn clause language C and a correct inference relation on C, 
an ILP learning problem can be formalized as follows. Assume a background 
knowledge BK. expressed in a language £B C £, and a set of examples £ in 
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a language LE C £. The goal is to produce an hypothesis h in an hypothesis 
class Ti Q C consistent with BK, and E such that h and the background knowl- 
edge cover all positive examples and none of the negative ones. Sometimes the 
formalism cannot correctly classify all examples according to the preceding sce- 
nario, for the reason that the examples describe a complex concept. We may 
transform the ILP learning problem to a relaxed version, where we want the 
formulae to make sufficiently small errors over the examples. The choice of the 
representation languages for the background knowledge and the examples, and 
the inference relation greatly influence the complexity (or decidability) of the 
learning problem. A common restriction for both BJC and E is to use ground 
facts. As in [5], we use 0-subsumption as the inference relation (a clause hi 9- 
subsumes a clause h 2 iff there exists a substitution 9 such that h\9 C h 2 [5, 4]). 
In order to treat our problem as a classical ML problem, we use the following 
lemma, which authorizes us to create ordinary examples: 

Lemma 1. [5] Learning a Horn clause program from a set of ground background 
knowledge BJC and ground examples E , the inference relation being generalized 
subsumption, is equivalent to learning the same program with 9-subsumption, 
and empty background knowledge and examples defined as ground Horn clauses 
of the form e ^ b, where e € E and b G BIC. 

In the following, we are interested in learning concepts in the form of (sets of) 
non recursive Horn clauses. It is important to note that all results are still valid 
when considering propositional, determinate or local Horn clauses, similarly to 
the study of [4], to which we refer for all necessary definitions. For the sake of 
simplicity in stating our results, we sometimes abbreviate “Function free Horn 
Clauses” by the acronym “FIHC” . 

3.2 Basic Tools for the Hardness Results 

Concerning problem 1, fix a G {1,2, 3, 4, 5, 6 , 7}. We want to approximate the 
best concept in 'H{D-)a{C) by one still in However, the best concept 

in ’H(D-)aiC) generally does not have an error equal to the optimal one over H 
given D, optuoi^)- fact, it has an error that we can denote optn^^^.^^i^Qfc) = 
minv6-H(c;)a(0 J2h{x)^c{x) ^ opt-Hoic). The goodness of the accuracy of a 

concept taken from 'H{D-,)a{0 should be appreciated with respect to this latter 
quantity. Our results on problem 1 are all obtained by showing the hardness of 
solving the following decision problem: 

Definition 1. Approx- Constrained (H, {a X)): 

Instance .• A set of negative examples S~ , a set of positive examples S'^ , a 
rational weight 0 < w(xi) = ^ < 1 for each example Xi, a rational 0 < 7 < 1. 
We assume that = 1- 

Question .• 31h G H(D-,)a{0 satisfying Y.h(x)^c(x) ^ 7 ? 

Define as rie the size of the largest example we dispose of. Note that when the 
constraint is too tight, it can be the case that = 0- Define as \h\ the 
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size of some h € Ti. (in our case, it is the number of Horn clauses of h) . In the non- 
empty subset of Ti where formulae are the most constrained {i.e. strengthening 
further the constraint gives an empty subset), define Uopt^^ ^ as the size of 

the smallest hypothesis. Then, our reductions all satisfy ^ < {neY- 

Note that the constraint makes generally > optuoic)- However, 

the reductions all satisfy du {^ptnu{'^)Toptn^o.)^{c,)i.c^ = o(l)> *-e- asymptotic 
values coincide. In addition, the principal result we get (similar for all other 
problems) is that we can suppose that the whole time used to write the total set 
of Horn clauses is assimilated to 0{ne), for any set. By writing time, we mean 
time of any procedure consisting only in writing down clauses. Examples of such 
a procedure are “write down all clauses having k literals” , or even “write down 
all Horn clauses”. Such procedures can be viewed as for- to, or repeat algorithms. 
This property authorizes the construction of Horn clause sets having arbitrary 
sizes, even exponential. Problem 2 is addressed by studying the complexity of 
the following decision problem. 

Definition 2. Approx-Constrained-ROC(H,jFPR,jTPR): 

Instance .• A set of negative examples S~ , a set of positive examples S'^ , a 
rational weight 0 < w{xi) = ^ < 1 for each example Xi, a rational 0 < 7 < 1 . 
We assume that = 1- 

Question .• 37h € H satisfying 1 — FPR > 1 — "/fpr and TPR > jtpr? 

Concerning problem 3, the reductions study a single replacement criterion P, 
and the following decision problem. 

Definition 3. Approx-Constrained-Single(Ti,, P,^): 

Instance .■ A set of negative examples S~ , a set of positive examples , a 
rational weight 0 < w{xi) = ^ < 1 for each example Xi, a rational 0 < 7 < 1 . 
We assume that X^es+us- = 1- 

Question .• 31h e H. satisfying P{h) < 7 ? 

4 Hardness Results 

Theorem 1. We have: 

[1] VO < C < 1; Approx-Constrained(FfHC, (l,C)j Hard, when p < {1 — C)/C- 

[2] VO < C < Approx-Constrained(FfHC, (2,C)j is Hard. 

[3] Vo G {3, 4, 5, 6 , 7}, VO < C < 1, Approx-Constrained(FfHC, (a,Q) is Hard. 

At that point, the notion of “hardness” needs to be clarified. By “Hard” we 
mean “cannot be solved in polynomial time under some particular complexity 
assumption”. The notion of hardness used encompasses that of classical NP- 
completeness, since we use the results of [ 2 ] involving randomized complexity 
classes. All our hardness results are to be read with that precision in mind. 

Due to space constraints, only proof of point [1] is presented in appendix 2; 
all other results strictly use the same type of reduction. Also, in appendix 1 , 
we sketch the proof that all distributions under which our negative results are 
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proven lead to trivial positive results for the same problem when we remove the 
additional constraint, and optimize the accuracy alone. While negative results 
for optimizing the accuracy itself would naturally hold when considering the 
additional constraints, we therefore prove that optimizing the accuracy under 
constraint is a strictly more difficult problem, with non-trivial additional draw- 
backs. Furthermore, the upperbound error value (7 in def. Q in constraints 4, 
5, 6, 7 can be fixed arbitrarily in ]0, l/2[, i.e. requiring the Horn clauses set to 
perform slightly better than the unbiased coin does not make the problem easier. 
We now show that the classical ROC components as described by [7] lead to the 
same results as those we claimed for the preceding bi-criteria optimizations. The 
problem is all the more difficult as the difficulty appears as soon as we choose 
to use ROC analysis, and is not a function of the ROC bounds. 

Theorem 2. Approx-Constrained-ROC(FfHC, JfpRtJtpr) is Hard; the result 
holds VO < ')fpr,1tpr < 1- 

The distribution under which the negative result is proven is an easy distribution 
for the accuracy’s optimization alone, similarly to those of the seven constraints. 
We now investigate the replacement of the accuracy by a single criterion. The 
negative result stated in the following theorem is to be read with all additional 
drawbacks mentioned for the previous theorems. Again, the distribution under 
which the theorem is proven is easy when optimizing the accuracy alone. 

Theorem 3. 3'^raax > 0 such that VO < 7 < 'Ymax, the problem Approx- 
Constrained- Single (FfHC,{l — FPR) X TPR,"f) is Hard. 

(Proof sketch included in appendix 2). As far as we know, ^max > 4 ilie (roughly 
4.2 X 10“^), but we think that this bound can be much improved. 

5 Appendix 1: The Global Reduction 

Reductions are achieved from the A^P-Complete problem “Clique” [2], whose 
instance is a graph A graph G = (X,E), and an integer k. The question is 
“Does there exist a clique of size > fc in G?” . Of course, “Clique” is not hard to 
solve for any value of k. The following lemma establishes values of k for which 
we can suppose that the problem is hard to solve ((^) = n\/{{n — k)\k\) is the 
binomial coefficient): 

Theorem 4. (i) We can suppose that ( 2 ) < \E\, and k is not a constant, oth- 
erwise “Clique” is polynomial, (ii) For any a g]0, 1[, “Clique” is hard for the 
value k = a\X\ or k = \X\°‘ . 

Proof, (i) is immediate ; (ii) follows from [2]: it is proven that the largest clique 
size is not approximate to within |A|^, for any constant 0 < /3 < 1. Therefore, 
the graphs generated have a clique number which is either I, or greater than 
I X |A|^, with I < Therefore, the decision problem is intractable for 

values of k > I, which is the case if fc = o;|A| or fc = |A|“, with a g] 0, 1[. □ 
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The structure of the examples is the same for any of our reductions. Define 
a set of \X\ unary predicates ai(. a|x| (•)> bijection with the vertices of 
G. To this set of unary predicates, we add two unary predicates, s(.) and t{.). 
The inferred predicate is denoted q{.). The choice of unary predicates is made 
only for a simplicity purpose. We could have replaced each of them by ?-ary 
predicates without changing our proof. Define a set of constant symbols useful 
for the description of the examples: G if} U {^i, ^2, ^3, ^4} U {mi, Vi G 

{1,2,..., |ff|}}. Examples are described in the following way. Positive examples 



from S~^ are as follows: 

Pi = <l{h) <— Afeg{i_2,...,|x|}Ofc(^i) A t{li) (2) 

P2 = qih) ^ aiih) (3) 

Negative examples from S~ are as follows: 

Vi G {1, 2, ..., \X\}, m = q{mi) ^ ^ke{i, 2 ,...,\x\}\{ 2 }ak{m^) A t{nii) (4) 
n'l = qih) <— Afce{i,2,...,|jf|}afe(^3) A 5(^3) A t{ls) (5) 
n '2 = q{k) ^ ^kG{i, 2 ,...,\x\}ak{k) A s{k) (6) 



It comes that nopt^^^ ^ (^^(c) = 0{\X\^) (coding size of positive examples) and 

rZe = 0(|ff|). Non-uniform weights are given to each example, depending on the 
constraint to be tackled with. The common-point to all reductions is that the 
weights of all examples nj (resp. all pij) are equal (resp. to w~ and In 
each reduction, examples and clauses satisfy: 

Hi p2 is forced to be badly classified. 

H2 n'l is always badly classified. 

H3 w{n' 2 ) ensures that n^ is always given the right class, forcing any clause to 
contain literal t{.) (When we remove n^, we ensure that p 2 is removed too). 

Lemma 2. Any clause containing literal s(.) can he removed. 

Proof. Suppose that one clause contains s(.). Then it can be 6*-subsumed by n{ 
and by no other example (even if n '2 exists, because of H3); but n{ 0-subsumes 
any clauses and also the empty clause. Therefore, removing the clause does not 
modify the value of any criteria based on the examples weights. Concerning the 
sixth constraint, the fraction of predicates used after removing the clause is at 
most the one before, thus, if the clause is an element of 'He(f) before, it is still 
an element after. The same remark holds for the seventh constraint. □ 

As a consequence, pi is always given the positive class (even by the empty 
clause!). We now give a general outline of the proof for Problem 1 ; reductions 
are similar for the other problems. Given h = {hi, h 2 , hi} a set of Horn clauses, 
we define the set X = {i G {1, 2, ..., |A|}|3j G {1, 2, ..., /}, Oi(.) ^ hj], and we fix 
\I\ = k' . In our proofs, we define two functions taking rational values, E(k') and 
Fa(k') {k' G {1, 2, ..., |A||, a = 1,2, 3, 4, 5, 6, 7). They are chosen such that: 
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- E{k') is strictly increasing, ExgS+uS-|/i(o;)#c(x) ^ E{k) = 7 . 

— Ea{k') is strictly decreasing, is a lowerbound of the function inside H(D,)a{C), 

and Ea{k) = ( (excepted for a = 3, = 1/C)- 

Va G {1,2, 3, 4, 5,6,7}, if there exists an unbounded set of Horn clauses h G 
H(D,)a(0 satisfying J2(x&s+Ah{x)=o)v(x&S-Ah(x)=i)wix) < 7, its error rate im- 
plies k' < k and constraint implies k' > k. So \I\ = k' = k. The interest of the 
weights is then to force ( 2 ) positive examples from the set to be well 

classified, while we ensure the misclassification of at most k negative examples 
of the set {ni]i(z{i^ 2 ,...,\x\}- It comes that these ( 2 ) examples correspond to the 
( 2 ) edges linking the \I\ = k vertices corresponding to negative examples badly 
classified. We therefore dispose of a clique of size > k. 

Conversely, Va G (1, 2, 3, 4 , 5, 6 , 7}, given some clique of size k whose set of ver- 
tices is denoted X, we show that singleton h = q{X) ^ A 

t{X) is G 7f(D,)a(C), satisfying 'Z(x^s+ Ah(x)=o)^{x<^s- Ah{x)=i)w{x) < 7- In this 
case, ^ drops down to 0(ne). 

All distributions used in theorems 1 and 3 are such that w+ < w~ /\X\, at least 
for graphs exceeding a fixed constant size. Also, due to the negative examples of 
weights w~ , if we remove the additional constraints and optimize the accuracy 
alone, we can suppose that the optimal Horn clause is a singleton: merging all 
clauses by keeping over predicates aj{.) only those present in all clauses does 
not decrease the accuracy. Under such a distribution, the optimal Horn clause 
necessarily contains all predicates a^(.), and the problem becomes trivial. The 
distribution in theorem 2 satisfies w+ = w~ . This is also a simple distribution 
for the accuracy’s optimization alone: indeed, the optimal Horn clause over pred- 
icates aj{.) is such that it contains no predicates aj{.) that does not appear at 
least in one positive example. If the graph instance of “Clique” is connex (and 
we can suppose so, otherwise the problem boils down to find the largest clique 
in one of the connected components), then the optimal Horn clause does not 
contain any of the Oj(.). 



6 Appendix 2: Proofs of Negative Results 
6.1 Proof of Point [1], Theorem H 

Weights of positive examples: w{p 2 ) = 2 ( 1 ^^) + |Alpw“(l -|- ^)); G E, 



i^ + |A|-A: 



'^{Pi,j) = w+ = (\x\+ky ^ ; '“^(Pi) = ~ w) ~ , j, 

\ (l-E'l ~ (2)) + 1^1 )■ Weights of negative examples: 10(712) = 1/2; 

Vj G {1,2,..., |A|},w(nj) =w~ = w(n'i) = 5 (|-U| - ( 2 )) + 

i ((|A |2 - k)w~). 

Fix 7 = {w(p2) + w(n'i) + kw~ + ^|U| — (2))) /2 (note that 10(112) ensures that 
n'2 is given the right class), and fc^ax = 1 + max2^j,„^|2j-|.|£.|_^fc”^>P fc". From 
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the choice of weights, lcm(Ua;.gs+us-di) = 0(|X|®) (“1cm” is the least com- 
mon multiple), which is polynomial. Define the functions: Vfc' G { 0 , 1 }, = 

\E\w'^+k'w~+w{p2)+w{ni)\ V 2 < fc' < fcmax, E{k') = ^|if| — (2)^ w'^+k'w~ + 

w{p2) + w{ni)\ V/cmax < k' < \X\,E{k') = k'w~ +w{p2) + w{ni). From the choice 
of weights, E{k) = 7. \/k' G { 0 , 1 }, F’i(fc') = — k'w~ + w{p2) — w{ni)\/q] 

\/2 < k' < fcmax, = I (^\E\ - (2)) - k'w~ + w{p 2 ) ~ w{ni)\/q\ Vfcmax < 

k' < |X|, Fi(fc') = I — k'w~ + w{p2) — w{ni)\/q, with q = y + \E\w'^ + k'w~ + 
w{p2) + w{ni). From the choice of weights, Fi{k) = 

The equation obtained when k' < fcmax takes its maximum for integer val- 
ues when fc' = (|X| -I- fc)^ -I- 0,5 ± 0,5 > |X|. Furthermore, VI < fcmax < 

|X|, ^|if| - (^““2”^)) , which leads to if(fcmax - 1) < £^(fcmax)- In 

a more general way, E(k') is strictly increasing over natural integers. Now 
remark that the numerator of Fi(k') is strictly decreasing, and its denomi- 
nator strictly increasing. Therefore, T’i(fc') is strictly decreasing. Furthermore 

du i^h(x)^i=c(x)w{x),Y.h(x)^o=c{x)w{x)^ > Fi{k'). If 3/1 G H{,„.}.i(C) satisfy- 
ing J 2 h{x)^c(x) w{x) < 7, the error rate implies k' <k and the constraint implies 
k' > k. Thus \I\ = k' = k. As pointed out in the preceding appendix, this leads 
to the existence of a clique of size > fc. 

Reciprocally, the Horn clause h constructed in Appendix 1 satisfies both relations 
h G and T.h{x)^c(x)w{x) < 7. Indeed, we have Y.h(x)^i=c(x)w{x) = 

(1^1 “ (2)) + w{p2), but also Y.h(x)^o=c(x) = kw~ +w(ni). Therefore, 

(Eh(x)^i=c(x) w(x),Eh(x)^o=c(x) ^(^)) = ^i(^) = C and fc G H{^7,i(C). We 
have also w{x) = F{k) = 7. The reduction is achieved. We end by re- 
marking that i(c)(c)^ < (|A|ui+ -k \X\w~)/ + v^, 

which is 0(1) (as |A| — > 00 or \E\ 00), as claimed in subsection I.S. 2 I 



6.2 Proof Sketch of Theorem El 



Remark that TPR{\ — EPR) = TPRx TNR. Weights are as follows for positive 
examples (we do not use p2 ) : 



V(i,j) G E,w{pij) = = 



7 



{\X\-k)w- X 



(k\ (|X| + l) = -(fe-I^) -3|X| 

I2/ 6 



w{pi) = w'^ X ^(|A| + 1 )^ — ^fc — — 3 |A|^ /6. Weights are as follows 

for negative examples (we do not use n^): Vj G { 1 , 2 , ..., |A|}, w(nj) = w~ = 
1 /(|A| + fc); w{n'i) = 1 — \E\w~^ — |A|r(;“ — w{pi). The choice of 'jmax comes 
from the necessity of keeping weights within correct limits. We explain how to the 
existence of a clique, by describing a polynomial of degree 3 , F(k') which upper- 
bounds TPRxTNR, and of course has the desirable property of having its maxi- 
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mum for k' = k, with value 7, and with no other equal or greater values on the in- 
terval [0, |AT|]. Similarly to the other proofs, the value 7 can only be reached when 
k' = k represents k “holes” among predicates {oj(.)}, and this induces a size-fc 
clique in the graph. Define the function F{k') as follows. Vfc' S {0, l},i^(fc') = 

w{pi) X {\X\ - k')w~; V2 < fc' < fcmax,^’(fc') = ((2)^''’ + ^(Pi)) (1^1 “ k')w~] 
VA:max < k' < \X\,F{k') = {\E\w'^ + w(pi))(|X| — k')w~ . With our choice of 
weights, and inside the values of k' for which we described k (clearly, in the 
second F{k')), F describes a polynomial of degree 3, shown in figure El F upper- 




Fig. 2. Scheme of F{k') = + w(pi)^ (|X| — k')w . 



bounds T Pi? xTNRof any set of Horn clauses, and the demand on T PR x T NR 
leads to a single favorable case: the ’’holes” inside the set of Horn clauses describe 
a clique of size k' = k in the graph. 
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Abstract. This paper proposes a new framework of knowledge revision, 
called Similarity- Driven Knowledge Revision. Our revision is invoked 
based on a similarity observation by users and is intended to match with 
the observation. Particularly, we are concerned with a revision strat- 
egy according to which an inadequate variable typing in describing an 
object-oriented knowledge base is revised by specializing the typing to 
more specihc one without loss of the original inference power. To realize 
it, we introduce a notion of extended sorts that can be viewed as a concept 
not appearing explicitly in the original knowledge base. If a variable typ- 
ing with some sort is considered over-general, the typing is modified by 
replacing it with more specific extended sort. Such an extended sort can 
efficiently be identihed by forward reasoning with SOL-deduction from 
the original knowledge base. Some experimental results show the use of 
SOL-deduction can drastically improve the computational efficiency. 



1 Introduction 

This paper proposes a novel method of knowledge revision (KR), called Similar- 
ity-Driven Knowledge Revision. In traditional KR methods previously proposed 
(e.g. P3), their revision processes are triggered by some examples logically in- 
consistent with an original knowledge base. Then the knowledge base should be 
revised so that its extension (defined by a set of facts derived from it) becomes 
consistent with the examples. As a result, the inconsistency in the knowledge 
base will be removed. 

Although we can have several revision methods to remove the conflict as a 
logical inconsistency, a minimal revision principle is especially preferred. Here 
the principle implies a strategy for selecting a new knowledge base that has a 
minimal extension consistent with the examples. Thus, both the notion of con- 
flicts in knowledge bases and the basic revision strategy have been investigated 
in terms of extensions. 

However, even if we find no extensional conflict in our knowledge base, we 
might need to revise it from an non- extensional viewpoint. We intend to utilize 
the revision from the non-extensional viewpoint to make our revision faster in the 
presence of extensional conflicts and to do revision even for cases not involving 
them. 
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male 



s 




female 



A male_with_long_hair 
(male A has_long_hair) 



female _with_long_hair 
(female a has_long_hair) 



Fig. 1. Extensional Relationship between male and female 



As a first approach to such a non-extensional revision, this paper considers 
a conflict caused by an over-general typing of variables in an object-oriented 
knowledge base. When the knowledge base contains an over-general typing of 
some variable, then it deduces negative facts that will eventually be observed. 
However, it is not only way to And the over-general typing. This paper proposes 
to use a notion of similarities between types to And the inappropriateness, and 
tries to present a revision method based on it. As a type becomes more general, 
we have more chances to observe properties and rules shared by another type. 
Thus the possible class of similarities between types will contain a similarity that 
does not meet a user’s intention. 

Suppose we have a method for finding a similarity ip for a given knowledge 
base. There exist two cases where the similarity does not At a user’s intention: 1) 
Ip shows the similarity between types si and S 2 , while the user considers they are 
not similar and 2) ip shows the dissimilarity between si and S 2 , while the user 
considers they are similar. This paper is especially concerned with a knowledge 
revision in the former case, assuming GDA, Goal- Dependent Abstraction, as an 
algorithm to And the similarities between types under an order-sorted logic, 
where the types are called sorts in the logic. 

Given a knowledge base K.B and a goal G to be proved, GDA detects an 
appropriate similarity for G between sorts si and S 2 reflected in ICB if the same 
property relevant to G is shared for both si and S 2 . For example, suppose we 
have knowledge “if female has a property has Jong Jiair, then female has a 
property takes JcmgJimeshower” and “if male has the property has Jong Jiair, 
then male has the property takes Jong dime shower'" . It can be represented as 
the following order-sorted clauses: 

“takes Jongdimeshower{X : female) ^ has Jong Jiair [X : female)" and 

“takes Jong dime shower [X : male) ^ has Jong Jiair {X : male)" . 

In these clauses, the description “X : s" is called a variable typing and means that 
the range of the value is restricted to an object (constant) belonging to the sort s. 
For the knowledge, GDA detects a similarity between female and male with re- 
spect to the property “takes Jong dimeshower" . However, this similarity seems 
not to At our intuition very well. Such an unfltness would come from the fact 
that “has Jong Jiair" would be considered as a feasible feature for female, while 
not so for male, as illustrated in Fig. [IJ This implies that the variable typing 
X : male is inappropriate in the sense of over-general. We consider such a wrong 
typing as an intensional conflict in the original knowledge base and try to resolve 
it. The wrong typing X : male is specialized (revised) by finding a new sort con- 
cept s' that is a subconcept of male and has the concept “malejwith Jong Jiair" 
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as its subconcept, as shown in Fig.^ In a word, such a new s' is an imaginable 
subconcept of male that has the property “hasJongMair’’'’ as its feasible fea- 
ture like female has. For the revised knowledge base, GDA detects a similarity 
between s' and female that seems to fit our intuition. 

Thus, our revision is triggered when we recognized an unfitness of detected 
similarity. Therefore, we call it Similarity-Driven Knowledge Revision. 

A newly introduced imaginable concept is defined as an extended sort. An 
extended sort denotes a concept that does not appear explicitly in the original 
knowledge base. If a variable typing A : s is considered over-general, the typing 
is modified by replacing the general sort s with a more specific extended sort 
s' . In order for the extended sort to be meaningful, we present four conditions 
to be satisfied. It is theoretically showed that such an extended sort can be 
identified by a forward reasoning from the original knowledge base. Especially, a 
forward reasoning with SOL-deduction is adopted for an efficient computation. 
Some experimental results show that the use of SOL-deduction can drastically 
improve the computational efficiency. 

2 Preliminaries 

In this section, we introduce some fundamental terminologies and definitions 0. 

Our knowledge base consists of the following three components. 

Sort Hierarchy: A sort hierarchy is given as a partially ordered set of sort 

symbols, H = (S', ^). Each sort s G S denotes a sort concept and is interpreted 
as the set of possible instances of the concept. The relation s ^ s' means that s 
is a subconcept of s'. 

Type Declaration: A type declaration, TD, is a finite set of typed constants 

that is of the form {ci : si, . . . , c„ : s„}, where Ci is a constant symbol denoting 
an object and Si is a sort in S. A typed constant Ci : Si declares that the object 
denoted by Ci primarily belongs to the sort concept st, where Si is called the 
primary type (or simply, type) of Ci which is referred to as [ci\. The extension of 
sort is defined as follows: 

Definition 1 (Extension of Sort). Let s be a sort. The extension of s, Ag, is 
defined as Ag = {c | [c] Y s}. We say that c belongs to s, if c G Ag. I 

From the interpretation, it is obvious that if s ^ s' and [c] = s, then c belongs 
to s' as well as to s. 

Domain Theory: A domain theory (or simply, theory), T, is a set of function- 

free Horn clauses of the form A <— Hi A • • • A where A and Bi are positive 
literals of the form p(fi, . . . ,tm) and each term ti is a typed variable Xi : Si 
or a typed constant. A typed variable Xi : Si means that the range of values 
is restricted to instances belonging to Si. On the other hand, we assume that a 
predicate p can take objects of any sort concept as its arguments. 



^ Throughout this paper, we assume that the notions with respect to First-Order Logic 
are familiar to the reader. 
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The inference under our knowledge base is the same as standard first-order 
deduction except the following point: 

Every substitution Q = {Xi : Sijt^ should satisfy the type constraint 
that [ti\ A Si- That is, the variable must be instantiated to a variable or 
a constant of more specific sort concept 0 . 



3 Goal-Dependent Abstraction 



In this section, we briefly present a framework of Goal- Dependent Abstraction 
that is used as the basis of our similarity finding algorithm GDA. 



3.1 Abstraction Based on Sort Mapping 

We first introduce a framework of abstraction based on sort mapping, an extended 
version of the framework of abstraction based on predicate mapping | 3 |. 

Definition 2 (Sort Mapping). Let (S', A) be a sort hierarchy. A sort mapping 
is a mapping tp : S S', where S' is an abstract sort set such that S n S' = (/). I 

A sort mapping can easily be extended to a mapping over a set of any ex- 
pressions. For an expression E, ip{E) is defined as the expression obtained by 
mapping the sort symbols in E under ip. 

For any expressions Ei and E2, if 4 ’{Ei) = ip{E2) = E' , then E\ and E2 are 
said to be similar and to be instantiations of E' under ip. Thus, a sort mapping 
can be viewed as a representation of similarity between sorts. Therefore, we often 
use the term “similarity” as a synonym for “sort mapping” . 

Definition 3 (Theory Abstraction Based on Sort Mapping). Let K.B =< 

(S', <),TD,T > be an order-sorted (concrete) knowledge base and ip : S S' 
be a sort mapping. The abstract domain theory of T based on ip, SortAbs.,p{T), is 
defined as SortAbs.,p{T) = {C \ VC S ip~^{C) T h C}, where for an expression 
E', ip~^{E') is defined as ip~^{E') = {E \ ip{E) = E'}. I 

For example, assume we have a domain theory consisting of three clauses 
(facts) p{X : si), p{X : S2) and q{X : si). Based on a similarity ip such that 
ip{si) = ip{s2) = s', the first two clauses can be preserved as an abstract 
clause p{X : s'), while the last one cannot so. Thus, only such a property p 
shared among all similar sorts are preserved by the abstraction process. Based 
on this interesting characteristic, a dynamic abstraction framework, called Goal- 
Dependent Abstraction, has been proposed m 

^ We consider in this paper only substitutions satisfying this constraint. 
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3.2 Appropriate Similarity for Goal 

When we try to prove a goal from our knowledge base, the used (necessary) 
knowledge is completely dependent on the goal. Observing which properties 
(knowledge) can be preserved in the abstraction process based on a similarity, 
we define an appropriateness of similarity with respect to a given goal. 

Definition 4 (Appropriateness of Similarity for Goal). Let T be a theory, 
ip a similarity and G a goal. Assume that G can be proved from T and Proof {G) 
is the set of clauses in T that are used in a proof of G. The similarity tp is said 
to be appropriate for G if %p{Proof{G)) C SortAbs^{T). I 

If a similarity ip is appropriate for a goal G, it is implied that the properties 
appearing in Proof (G) (that is, the properties relevant to G) are shared among 
all similar sorts defined by 0 . Since such an appropriate similarity depends on 
a goal we try to prove, we can realize a goal-dependent aspect of abstraction 
based on an appropriate similarity for the given goal. 

Given a knowledge base and a goal to be proved, an algorithm GDA detects an 
appropriate similarity for the goal. Discussing GDA in more detail is beyond the 
scope of this paper. Its precise description can be found in the literatures m 
In our knowledge revision, GDA is used to detect a similarity reflected in our 
knowledge base. Our revision process is invoked when the detected similarity 
does not fit our intuition. 

4 Similarity-Driven Knowledge Revision for Type 
Specialization 

In this section, we propose a novel method of knowledge revision, called Similar- 
ity-Driven Knowledge Revision. Before giving a formal descriptions, we present 
some assumptions imposed in the following discussion. 

Our knowledge revision is triggered by an observation of undesirable similar- 
ity that is reflected in a given original knowledge base. For a knowledge base and 
a goal G, assume GDA detects a similarity ip such that si S2- For the similar- 
ity, assume that a user considers they are not similar because the user recognizes 
some difference between si and S2 with respect to the plausibility of a property 
p relevant to G, as illustrated in the first section. In this case, our knowledge 
base is tried to revise by specializing a variable typing with si or S2, which is 
considered too general with respect to the property p. However, since such a 
recognition of over-generality seems to highly depend on the user’s subjectivity, 
our revision system would not be able to decide itself which variable typing to 
be specialized. Therefore, we assume that a variable typing to be specialized is 
given to the system by the user as an input. 

In our knowledge base, the occurrence of a given variable typing to be spe- 
cialized might not be identified uniquely. In that case, it would be desired to 



® Briefly speaking, in terms of EBL method 0, this means that for any similar sort, the 
goal G can be explained in the same way (that is, the same explanation structure). 
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adequately specialize all of the occurrences. Although one might consider that a 
naive way is to specialize them individually, the result of one specialization pro- 
cess deeply affects subsequent specializations. It is not so easy to obtain a good 
characterization of such a complicated revision at the present time. In this pa- 
per, therefore, we deal with only a case where a variable typing to be specialized 
can uniquely be identified. 



4.1 Extended Sort 

We first introduce a notion of extended sort that is a core concept in our revision. 

In our method, a modification of the type s of a variable is realized by finding 
a new sort s' that is more specific than s. Such a newly introduced sort is precisely 
defined as an extended sort. 

Definition 5 (Extended Sort). Let us consider a conjunction of atoms, es, 
and focus on a typed variable AT : s in es. Then, es is called an extended sort 
with the root variable X. The root variable is referred to as Root{es). I 

An extended sort is interpreted as the set of possible instances of its root 
variable, as defined below. In the definition, a Herhrand model of a theory T is 
a subset M of Herbrand base 

B — {p(ai, ..., a„)|p is a predicate symbol and aj is a typed constant symbol} 
which satisfies a condition that for any clause A <— Bs in T and any ground 
substitution 9, A9 G A4 whenever Bs9 C A4. 

Definition 6 (Extension of Extended Sort). Let T be a theory and A4 be 
an Herbrand Model of T. Consider an extended sort es such that Root{es) = X. 
The extension of es with respect to M, E_\ 4 {es), is defined as 

EMies) = {Xd I 0 is a ground substitution to es such that es9 C At}. 

For a constant c, if c G Ej^{es), then it is said that c belongs to es. I 

For example, es = hasjijc.hild{X : per son, Y : person) A likes(Y,Z : dog) with the 
root variable X is interpreted as the set of persons each of which has a child 
who likes a dog. Thus the variables except the root variable are existentially 
quantified. 

An ordering on extended sorts can be introduced based on their extensions. 

Definition 7 (Ordering on Extended Sorts). Let T be a theory and SS 
be the set of all extended sorts. For any extended sorts esi and es 2 in £S, 
esi <T es 2 iff for any Herbrand Model M. of T, E>^(esi) C Em{^S 2 ). I 

Let us consider an extended sort, esi = has^son{X : father, Y : boy)f\dislikes{Y, X), 
whose root variable is X. esi means fathers each of which has a (young) son who 
dislikes his father. By the definition, esi is a subsort of father and parent under 
a sort hierarchy involving father -< parent and boy -< youngjperson. As a more 
general concept placed between esi and parent, we can suppose fathers each of 
which has a child who dislikes his/her parent, father or mother (possibly both 
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of two) . It will be expressed by an extended sort es2 = hasjzhild{X : father, W : 
young4>erson),dislikes{W, Z -.parent). It can be drawn from the following knowl- 
edge that esi is a special case of es2: 

— Domain Theory: We know has-child{X,Y) <— hasson{X,Y). Therefore 
esi is a special case of esa = hasjchild{X : father, Y : boy),dislikes{Y, X). 

— Sort Hierarchy: For the existentially quantified variables Z : parent and 
W : youngjperson in es2 to have their values, it suffices to have values 
in their subsort father and boy, respectively. In addition, it is possible to 
identify Z with X (that is, Z and X are the same person). This is done by 
a substitution ( = {Z : parent/X : father, W : youngjperson/Y : boy}. Thus 
any instance of esa turns out to be an instance of es2- 

As a result, it is found esi is more specific than es2 (via esa). We can summarize 
this argument by the following proof-theoretic characterization of the ordering. 

Theorem 1. Let T be a theory and esi and es2 be extended sorts such that 
Root(esi) = Root{es2) = X. Consider a ground substitution 9 to esi that 
substitutes no constant appearing in T . Then for a ground substitution a to es2 
such that X 9 = X<j, es\ :<t es2 iff T h es2cr <— esi 9 . I 

4.2 Revising Knowledge Base by Specializing Type of Variable 

We describe here our specialization process for over-general variable typing. 

For a knowledge base K.B =< H,TD,T > and a goal G, let us assume that 
GDA detects a similarity '0 such that si S2 and a user considers that si and 
S2 are not similar. In addition, assume the user considers a variable typing by 

51 to be over-general. 

Let “C = P ^ ES" be a clause in Proof (G), where ES is a conjunction of 
atoms and contains a typed variable X : si. Here we consider ES as an extended 
sort whose root variable is X. 

For an expression E, E(^f t') denotes the resultant expression obtained by re- 
placing every occurrence of t in if with t'. Since GDA detects si S2, it is found 
from Definition g| that the clause “C(^x-.si,x-.s2) = P(,x:si,x-.s2) ^ ESf^x-.s^.x-.s^f’ is 
true (provable) under the original knowledge base. Nevertheless, this similarity 
does not fit the user’s intuition. 

This unfitness would be caused in a situation such as one illustrated in Fig.Q 
In this example, the objects of si satisfying ES are a small part of si, while ones 
of S2 satisfying ES(x:si,x-.s2) most of S2. In a word, the objects of S2 satisfying 
ES(x:si,x-.s2) would be considered typical, while ones of si satisfying ES not so. 
In order for GDA not to detect a similarity between them, we introduce a new 
sort s'l which is a subsort of si and subsumes ES (refer to Fig. Then, the 
original clause C is modified into a new clause ‘'C^x:si,x-.s[) = P(x:si,x-.s[) ^ 
ESf^x:si,x:s[)" ■ Furthermore, the newly introduced sort is adequately inserted 
into the original sort hierarchy and the original type declaration is modified. As 
a result, for the revised knowledge base, GDA detects a similarity between and 

52 instead of one between si and S2. It should be emphasized here that since 
subsumes the extended sort ES, the extension of the original knowledge base is 
not affected by this revision. That is, our revision can be considered minimal. 
Below we discuss our specialization process in more detail. 
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Identifying Newly Introduced Sort: The newly introduced sort is pre- 
cisely defined as an extended sort that is identified according to the following 
criterion. 

Definition 8 (Appropriateness of Introduced Extended Sort). If an ex- 
tended sort es satisfies the following conditions, then it is considered as an ap- 
propriate extended sort to be introduced: 

Cl : ES <T es. 

C2 : For any answer substitution 9 to es w.r.t. the knowledge 
base ICB, there exists a term t such that X : si/t G 9. 

C3 : es n ES = 4>. 

C4 : For any eso such that eso C es and any selection of root variable in 
P and eso, P esQ. I 

By the condition Cl, it is guaranteed that the revision causes no influence on 
the extension of the original knowledge base. C2 is required in order for s( to 
become a proper subsort of si. C3 and C4 are imposed to remove redundant 
descriptions. 

An appropriate extended sort is basically computed in a generate-and-test 
manner. According to Theorem Q a forward reasoning from T U {ES9} is per- 
formed to obtain a candidate es such that T {E S9} h esc. Then the candidate 
is tested for its appropriateness based on the conditions C2, C3 and C4. If the 
candidate is verified to satisfy those conditions, the newly introduced sort s( is 
defined by es. Based on the definition of es, the user might assign an adequate 
sort symbol to s( if he/she prefers. 

For an efficient computation of es, we present an useful theorem and propose 
to adopt a reasoning method, SOL-deduction [ 7 |. 

Theorem 2. Let C = P ^ ES be the clause in Proof (G) to be replaced in the 
revision, and es be an extended sort such that Rootles) = Root(ES) = X : s. 
If es satisfies Cl, C2 and C4, then (T — {C}) U h esa and T — {C} \f esa, 

where 0 is a ground substitution to ES that substitutes no constant appearing 
in T and cr is a ground substitution to es such that X9 — Xa. I 

From the theorem, it is sufficient for our candidate generation to perform a 
forward reasoning from (T — {C}) U {ES9}, instead of from T U {ES9}. Since 
removing C from T will reduce the cost of the forward reasoning, we can expect 
an efficient candidate generation. 

In addition, the theorem says that any candidate derived from T — {C} is 
quite useless to obtain an appropriate extended sort. That is, it is sufficient 
to have candidates that can newly be derived by adding ES9 to T — {C}. If 
we have a method by which we can efficiently obtain such candidates, desired 
extended sorts can be obtained more efficiently. We can obtain such an efficient 
method with the help of SOL-deduction [Zl- Briefly speaking, for a theory T and 
a clause C, by performing SOL-deduction from TU {C}, we can centrally obtain 
all clause that can newly be derived by adding C to T. This characteristic of 
SOL-deduction is quite helpful for our task of candidate generation. However, it 
should be noted that we still have to test for the appropriateness based on C2, 
C3 and C4. 
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Inserting Extended Sort into Sort Hierarchy: After the computation of 
the newly introduced sort (defined as es), we have to modify the original sort 
hierarchy by inserting into an adequate position. Such a position is identified 
according to the following theorems. 

Theorem 3. Let es be an extended sort such that Rootles) = X : s. For any 
Herbrand Model M, E_\ 4 {es) C Ag. I 

The theorem says that the new sort should be a subsort of si in the resultant 
hierarchy. 

Theorem 4. Let es be an extended sort such that Rooties) = A : s. If there 
exists an answer substitution 9 to es with respect to JCB such that X9 = Y : t 
(where T is a new variable), then for any Herbrand Model At, At C Ej^(es). I 

Let 0 be the set of answer substitutions to es. Consider the set of sorts Suhses 
such that Subes = {s \ 6 G 0 and X : si/Y : s appears in 9}. Theorem 2] says that 
s^ should have every sort in Subes as its subsort in the resultant hierarchy. 

Modifying Type Declaration: As well as the modification of sort hierarchy, 
a modification of type declaration might be needed. 

The modification have to be performed according to the condition that Vc G 
Emt(^s), c belongs to where Mt is the Least Herbrand Model of T. The 
next theorem shows how to modify the original type declaration. 

Theorem 5. Let es be an extended sort such that Rooties) = X ■. s. The next 
two statements are equivalent: 

— c S EMri^s). 

— There exists an answer substitution 9 to es with respect to K.B such that 
X9 = c, or X9 = Y:t (where T is a new variable) and c € Aj. I 

In the case where an answer substitution 9 such that X9 = Y : t is obtained, 
no modification of type declaration is necessary. In this case, since the original 
sort hierarchy is modified and t becomes a subsort of s^ as the result, c G At 
obviously belongs to s'^. 

In the case where an answer substitution 9 such that X9 = c is obtained, let 
0 be the set of answer substitutions to es. Consider the set of constants Inst^s 
defined as Instes = {c \ 6 G 0 and X : si/c appears in 6}. From TheoremEl for any 
c G Instes, we have to modify the original type declaration TD into TD' based 
on which it is implied that c belongs to s^. Let us assume that for a constant 
c G Instes, c ■. t G TD. We have to take two cases into account. In the first case 
where t = si, the original type si of c is simply replaced with In the second 
case where t fy si, the type si of c is replaced with a sort Sc that corresponds 
to a common subsort of t and s^, since the type of a constant should be unique. 
As such a Se, it might be necessary to introduce a quite new sort and then the 
sort hierarchy might be modified following the introduction. 

Our revision process is summarized as an algorithm in Fig. |21 
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Input: A knowledge base K.B —<. H, TD,T >. 

Proof{G): the set of clauses that is used to prove a goal G from KJ3. 

A sort s to be specialized. 

Output: A revised knowledge base TD' , T' >. 

1. Extract a clause G — P ES from Proof{G) such that a variable typed with s appears in its 
body, and the variable X typed with s to the root variable of ES. 

2. Derive escr from (T — {C}) U {ES9} by a forward reasoning with SOL-deduction, where 0 is a 
ground substitution to ES that substitutes no constant appearing in KB. 

Then make sure that es satisfies the appropriateness conditions C2, C3 and C4. If such a es 
cannot be found, terminate with failure. 

3. Inform that a new sort s' to be introduced can be defined as es. 

4. Revise H to H' by replacing the clause C in T with C' — P(^x s x-s') ^^{x-s x-s')- 

5. Modify H into H' by inserting the new sort s' into an adequate position. 

6. Modify TD into TD' by redeclaring with s' or with a quite new sort s" if necessary. In the 
latter case, modify H' following the introduction of s" . 

7. Output the new knowledge base KB' —< H' ,TD' ,T' > and terminate. 



Fig. 2. Algorithm for Knowledge Revision by Type Specialization 



Sort Hierarchy H and Type Declaration TD : 

cup can 

/\ 

aluminum_cup plastic_cup ceramic_cup sTeel_can alumimim_can 




material 

/ 

metal \ 

/ \ \ \ 

aluminum steel plastic ceramic 



Domain Theory T: 

throw^away^on^friday(X) < — incombustible(X) . made^of(X : aluminum^cup, aluminum ; metal), 

throw^away^on^tues day(X) ■< — combustible{X) . made^of(X : plastic^cup, plastic ; material). 

incombustible(X : cup) ■< — made^of (X : cup, V : metal). made^of(X : ceramic^cup , ceramic : material) . 
incombustible(X : can) •* — made^of (X : can, Y : metal). made^of(X : steel^can, steel ; metal). 
has^handle{X : cup) ■< — made^of(X ; cup, Y ; metal). made^of(X : aluminum^can , aluminum : metal). 



Fig. 3. Order-Sorted Knowledge Base 



4.3 Example of Similarity-Driven Knowledge Revision 

We illustrate here our revision processes for a knowledge base shown in Fig. EB. 
For the knowledge base and a goal G = throwjiwayj3n_friday{X), we obtain 
Proof (G) = { throw -away -on_friday{X) ^ incombustible{X) , 

incombustible{X : cup) ^ madejof{X : cup,Y : metal), 
made-of{X : cup, aluminum : metal) }. 

And GDA detects a similarity ip such that cup can. Let us assume that 
contrary to the similarity, a user considers that cup and can are not similar and 
considers a variable typing with cup to be over-general. 

A clause in Proof {G) whose body contains a variable typed with cup is 
C = incombustible{X : cup) <— made-of(X : cup,Y : metal), where madc-of{X : 
cup, Y : metal) is considered as an extended sort ES whose root variable is 

In the figure, the sort hierarchy and type declaration are given in a hierarchical form. 
A plain line denotes a “is a subsort of” relation and a dotted line corresponds to a 
declaration of constant type. For example, it is declared that the type of a constant 
aluminum is a sort metal. 
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Table 1. Experimental Results 



KB Size 


R. Type 


Num. of Atoms 


Exec. Time 


KB Size 


R. Type 


Num. of Atoms 


Exec. Time 




FWR-T 


4 


SO 




FWR-T 


6 


1130 




FWR-Tc 


2 


50 




FWR-Tc 


2 


1020 


10 


SOL-T 


4 


10 


40 


SOL-T 


6 


40 




SOL-'-/ c 


2 


0 




SOL-'2 c 


2 


20 



X. According to Theorem El consider a ground instance of ES with a substi- 
tution 9 — {X : cup!a,Y : metal /h}, where a and b are constants not appear- 
ing in the original knowledge base. By a forward reasoning from (T — {C}) U 
{made-of{a,b)}, an atom hasJiandle(a) can be derived. 

As the next step, hasJiandle{X : cup) is tested for its appropriateness. 
The answer substitution to the goal hasJiandle{X : cup) is {X : cup/Y : 
aluminum j:up} , where T is a new variable. Therefore, hasJiandle{X : cup) 
satisfies C2. It is obvious that it satisfies C3. Furthermore, since incombustible{X : 
cup) 2 <T hasJiandle{X : cup), C4 is satisfied as well. Therefore, it is verified that 
hasJiandle{X : cup) is an appropriate extended sort. 

Let s be a new sort defined by the extended sort hasJiandle{X : cup). The 
sort s is tried to put to an adequate position in the original sort hierarchy. It 
is easily found that s becomes a subsort of cup. Moreover, since s subsumes 
the extended sort ES = madejof{X : cup,Y : metal), only aluminumjzup can 
become a subsort of s. 

The original type declaration does not need to be modified in this revision. 

Finally, replace the original clause C with C^x-.cup,x-.s) = incombustible{X : 
s) <— made-of{X : s,Y : metal). 

The extension of the newly introduced sort s is defined as one of hasJiandle{X : 
cup). Some sort symbol meaning “cup with handle” might be assigned to s. 

5 Experimental Results 

We have implemented a knowledge revision system based on our algorithm and 
made an experimentation to verify its usefulness 0. We show the results here. 

As discussed previously, we have four ways to obtain an extended sort to be 
introduced: 1) by forward reasoning from T\J{ES6}, 2) by forward reasoning from 
(T — {C}) U {ES9}. 3) by SOL-deduction from T U {ES9}. 4) by SOL-deduction 
from (T — {C}) U {ES9}. In our experimental results, they are referred to as rea- 
soning types, “FWR-T”, “FWR-Tc”, “SOL-T” and “SOL-Tc”, respectively. We 
provided two knowledge bases consisting of 10 clauses and 40 clauses. For each 
knowledge base and reasoning type, the number of atoms defining an extended 
sort and the execution time (msec.) were examined. Our experimental results 
are shown in Table Q 

The results show that the removal of C affects the quality of computed ex- 
tended sorts. For both knowledge bases, some unnecessary atoms were derived 

Our system has been written in SICStns Prolog and run on a SPARC-station 5. 
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by the reasonings from T . Moreover, such a meaningless derivation undesirably 
affects the execution times. It is, therefore, considered that the removal of C is 
effective to improve both the efficiency and the quality of our knowledge revision. 

The experimental results also show that the use of SOL-deduction can dras- 
tically reduce the execution time. The reduction ratios tend to be large as knowl- 
edge base grows. This implies that the use of SOL-deduction would be useful to 
improve the efficiency of our revision even in a real domain. We are currently 
considering a legal domain ^ as an attractive application field for our method. 

6 Concluding Remarks 

In this paper, we presented a novel framework of Similarity-Driven Knowledge 
Revision. Our revision process is invoked by an observation of undesirable sim- 
ilarity reflected in the original knowledge base. The knowledge base is revised 
by specializing an over-general variable typing X : s into more specific X : s'. 
The central task is to find an extended sort es by which s' is precisely defined 
and is an imaginable subsort of s. For the efficient computation of the extended 
sort, we presented an effective proof-theoretic property and proposed the use of 
SOL-deduction. A drastic improvement of the efficiency was empirically verified. 

An important future work is to investigate this kind of knowledge revision in a 
case where several over-general variable typings are found. Further investigation 
would be needed to obtain a good characterization of this complicated revision. 

Contrary to the assumption on similarity observation in this paper, it would 
also be worth studying a revision method in a case where a dissimilarity between 
two sorts is detected by GDA, while the user considers they are similar. In this 
case, it might be needed to add some new clauses to the original knowledge base 
as well as modifying variable typings. 
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Abstract. We introduce a new model for learning in the presence of 
noise, which we call the Nasty Noise model. This model generalizes pre- 
viously considered models of learning with noise. The learning process 
in this model, which is a variant of the PAC model, proceeds as follows: 
Suppose that the learning algorithm during its execution asks for m ex- 
amples. The examples that the algorithm gets are generated by a nasty 
adversary that works according to the following steps. First, the ad- 
versary chooses m examples (independently) according to the fixed (but 
unknown to the learning algorithm) distribution D as in the PAC-model. 
Then the powerful adversary, upon seeing the specific m examples that 
were chosen (and using his knowledge of the target function, the distri- 
bution D and the learning algorithm), is allowed to remove a fraction 
of the examples at its choice, and replace these examples by the same 
number of arbitrary examples of its choice; the m modified examples are 
then given to the learning algorithm. The only restriction on the adver- 
sary is that the number of examples that the adversary is allowed to 
modify should be distributed according to a binomial distribution with 
parameters rj (the noise rate) and m. 

On the negative side, we prove that no algorithm can achieve accuracy 
of e < 2?7 in learning any non-trivial class of functions. On the positive 
side, we show that a polynomial (in the usual parameters, and in e — 2r]) 
number of examples suffice for learning any class of finite VC-dimension 
with accuracy e > 2r). This algorithm may not be efficient; however, we 
also show that a fairly wide family of concept classes can be efficiently 
learned in the presence of nasty noise. 



1 Introduction 

Valiant’s PAC model of learning 1231 is one of the most important models for 
learning from examples. Although being an extremely elegant model, the PAC 
model has some drawbacks. In particular, it assumes that the learning algo- 
rithm has access to a perfect source of random examples. Namely, upon request, 
the learning algorithm can ask for random examples and in return gets pairs 
(x,Ct{x)) where all the x’s are points in the input space distributed identically 
and independently according to some fixed probability distribution D, and Ct{x) 
is the correct classification of x according to the target function ct that the 
algorithm tries to learn. 

Since Valiant’s seminal work, there were several attempts to relax these as- 
sumptions, by introducing models of noise. The first such noise model, called the 

* Some of this research was done while this author was at the Department of Computer 
Science, the University of Calgary, Canada. 



O. Watanabe, T. Yokomori (Eds.): ALT’99, LNAI 1720, pp. 206-|2]^ 1999. 
(c) Springer- Verlag Berlin Heidelberg 1999 
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Random Classification Noise model, was introduced in |2j and was extensively 
studied, e.g., in In this model the adversary, before providing 

each example {x,ct{x)) to the learning algorithm tosses a biased coin; when- 
ever the coin shows “H”, which happens with probability 77, the classification 
of the example is flipped and so the algorithm is provided with the, wrongly 
classified, example {x, l — ct{x)). Another (stronger) model, called the Malicious 
Noise model, was introduced in 123, revisited in m, and was further studied in 
In this model the adversary, whenever the 77 -biased coin shows 
“H”, can replace the example (x,Ct(x)) by some arbitrary pair {x',b) where x' 
is any point in the input space and 6 is a boolean value. (Note that this in 
particular gives the adversary the power to “distort” the distribution D.) 

In this work, we present a new model which we call the Nasty (Sample) Noise 
model. In this model, the adversary gets to see the whole sample of examples 
requested by the learning algorithm before giving it to the algorithm and then 
modify E of the examples, at its choice, where if is a random variable distributed 
by the binomial distribution with parameters rj and m, where m is the size of the 
sampl^. The modification applied by the adversary can be arbitrary (as in the 
Malicious Noise model) fl Intuitively speaking, the new adversary is more pow- 
erful than the previous ones - it can examine the whole sample and then remove 
from it the most “informative” examples and replace them by less useful and 
even misleading examples (whereas in the Malicious Noise Model for instance, 
the adversary also may insert to the sample misleading examples but does not 
have the freedom to choose which examples to remove). The relationships be- 
tween the various models are shown in Table ^ 





Random Noise- Location 


Adversarial Noise-Location 


Label Noise Only 


Random Classification Noise 


Nasty Classification Noise 


Point and Label Noise 


Malicious Noise 


Nasty Sample Noise 



Table 1. Summary of models for PAC-learning from noisy data 

We argue that the newly introduced model, not only generalizes the previous 
noise models, including variants such as Decatur’s CAM model and CPCN 
model H3I, but also, that in many real-world situations, the assumptions previ- 
ous models made about the noise seem insufficient. For example, when training 
data is the result of some physical experiment, noise may tend to be stronger in 
boundary areas rather than being uniformly distributed over all inputs. While 
special models were devised to describe this situation in the exact-learning set- 
ting (for example, the incomplete boundary query model of Blum et ah, ISI), it 
may be regarded as a special case of Nasty Noise, where the adversary chooses to 
provide unreliable answers on sample points that are near the boundary of the 
target concept (or to remove such points from the sample) . Another situation to 
which our model is related is the setting of Agnostic Learning. In this model, a 

^ This distribution makes the number of examples modified be the same as if it were 
determined by m independent tosses of an 77 -biased coin. However, we allow the 
adversary’s choice be dependent on the sample drawn. 

^ We also consider a weaker variant of this model, called the Nasty Classification Noise 
model, where the adversary may modify only the classification of the chosen points 
(as in the Random Classification Noise model). 
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concept class is not given. Instead, the learning algorithm needs to minimize the 
empirical error while using a hypothesis from a predefined hypotheses class (see, 
for example, m for a definition of the model). Assuming the best hypothesis 
classifies the input up to an rj fraction, we may alternatively see the problem 
as that of learning the hypotheses class under nasty noise of rate 77 . However, 
we note that the success criterion in the agnostic learning literature is different 
from the one used in our PAC-based setting. 

We show two types of results. Sections Eland EJshow information theoretic re- 
sults, and Sect.Elshows algorithmic results. The first result, presented in Sect. El 
is a bound on the quality of learning possible with a nasty adversary. This result 
shows that any learning algorithm cannot learn any non-trivial concept class 
with accuracy less than 2r} when the sample contains nasty noise of rate rj. It 
is complemented by a matching positive result in Sect. 0 that shows that any 
class of finite VC-dimension can be learned by using a sample of polynomial size, 
with any accuracy e > 2rj. The size of the sample required is polynomial in the 
usual PAC parameters and in 1/Z\ where A = e — 2rj is the margin between the 
requested accuracy e and the above mentioned lower bound. 

The main, quite surprising, result (presented in Sect. 0) is another positive 
result showing that efficient learning algorithms are still possible in spite of 
the powerful adversary. More specifically, we present a composition theorem 
(analogous to I3I8I but for the nasty-noise learning model) that shows that any 
concept class that is constructed by composing concept classes that are PAC- 
learnable from a hypothesis class of fixed VC-dimension, is efficiently learnable 
when using a sample subject to nasty noise. This includes, for instance, the class 
of all concepts formed by any boolean combination of half-spaces in a constant 
dimension Euclidean space. The complexity here is, again, polynomial in the 
usual parameters and in 1/A. The algorithm used in the proof of this result is 
an adaptation to our model of the PAC algorithm presented in 0 . 

Our results may be compared to similar results available for the Malicious 
Noise model. For this model, Cesa-Bianchi et al. PH show that the accuracy 
of learning with malicious noise is lower bounded by 77/(1 — 77 ). A matching 
algorithm for learning classes similar to those presented here with malicious 
noise is presented in jS|. As for the Random Classification Noise model, learning 
with arbitrary small accuracy, even when the noise rate is close to a half, is 
possible. Again, the techniques presented in jHj may be used to learn the same 
type of classes we examine in this work with Random Classification Noise. 

2 Preliminaries 

In this section we provide basic definitions related to learning in the PAC model, 
with and without noise. A learning task is specified using a concept class, denoted 
C, of boolean concepts defined over an instance space, denoted X. A boolean 
concept c is a function c : X 1 — > {0, 1}. The concept class C is a set of boolean 
concepts: C C {0, 1}'^. 

Throughout this paper we sometimes treat a concept as a set of points instead 
of as a boolean function. The set that corresponds to a concept c is simply 
{x\c{x) = 1}. We use c to denote both the function and the corresponding set 
interchangeably. Specifically, when a probability distribution D is defined over 
X, we use the notation D{c) to refer to the probability that a point x drawn 
from X according to D will have c{x) = 1. 
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2.1 The Classical PAC Model 

The Probably Approximately Correct (PAC) model was originally presented by 
Valiant m- In this model, the learning algorithm has access to an oracle PAC 
that returns on each call a labeled example (x,Ct(x)) where x € A" is drawn 
(independently) according to a fixed distribution D over A, unknown to the 
learning algorithm, and Ct S C is the target function the learning algorithm 
should “learn”. 

Definition 1. A class C of boolean functions is PAC-learnable using hypothesis 
class Ti. in polynomial time if there exists an algorithm that, for any c* G C, any 
0 < e < 1/2, 0 < i5 < 1 and any distribution D on X, when given access to the 
PAC oracle, runs in time polynomial in log \X\, Ijd, 1/e and with probability at 
least 1 — (5 outputs a function h Ghi for which: Pi£,[ct{x) yf h(x)] < e. 



2.2 Models for Learning in the Presence of Noise 

Next, we define the model of PAC-learning in the presence of Nasty Sample 
Noise (NSN for short). In this model, a learning algorithm for the concept class 
C is given access to an (adversarial) oracle NSNc,,;(?7i). The learning algorithm 
is allowed to call this oracle once during a single run. The learning algorithm 
passes a single natural number m to the oracle, specifying the size of the sample 
it needs, and gets in return a labeled sample S & {X x {0, 1})™. (It is assumed, 
for simplicity, that the algorithm knows in advance the number of examples it 
needs; It is possible to extend the model to circumvent this problem.) 

The sample required by the learning algorithm is constructed as follows: As 
in the PAC model, a distribution D over the instance space X is defined, and 
a target concept c* G C is chosen. The adversary then draws a sample Sg of 
m points from X according to the distribution D. Having full knowledge of the 
learning algorithm, the target function ct, the distribution D, and the sample 
drawn, the adversary chooses E = E{Sg) points from the sample, where E{Sg) is 
a random variable. The E points chosen by the adversary are removed from the 
sample and replaced by any other E point-and-label pairs by the adversary. The 
m — E points not chosen by the adversary remain unchanged and are labeled by 
their correct labels according to Ct ■ The modified sample of m points, denoted S, 
is then given to the learning algorithm. The only limitation that the adversary 
has on the number of examples that it may modify is that it should be distributed 
according to the binomial distribution with parameters m and rj, namely: 

Pr[if = n] = 

where the probability is taken by first choosing Sg G D™ and then choosing E 
according to the corresponding random variable E(Sg). 

Definition 2. An algorithm A is said to learn a class C with nasty sample noise 
of rate rj > 0 with accuracy parameter e > 0 and confidence parameter (5 < 1 
if, given access to any oracle NSNc^g(jn), for any distribution D and any target 
Ct G C it outputs a hypothesis /i : A i— > {0, 1} such that, with probability at least 
1 — (5 the hypothesis satisfies Pr£)[/iAct] < e. 
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We are also interested in a restriction of this model, which we call the Nasty 
Classification Noise learning model (NCN for short). The only difference be- 
tween the NCN and NSN models is that the NCN adversary is only allowed 
to modify the labels of the E chosen sample-points, but it cannot modify the 
E points themselves. Previous models of learning in the presence of noise can 
also be readily shown to be restrictions of the Nasty Sample Noise model: The 
Malicious Noise model corresponds to the Nasty Noise model with the adversary 
restricted to introducing noise into points that are chosen uniformly at random, 
with probability 77, from the original sample. The Random Classification Noise 
model corresponds to the Nasty Classification Noise model with the adversary 
restricted so that noise is introduced into points chosen uniformly at random, 
with probability rj, from the original sample, and each point that is chosen gets 
its label flipped. 



2.3 VC Theory Basics 

The VC-dimension EH], is widely used in learning theory to measure the com- 
plexity of concept classes. The VC-dimension of a class C, denoted VCdim(C), 
is the maximal integer d such that there exists a subset Y C X of size d for 
which all 2'^ possible behaviors are present in the class C, and VCdim(C) = 00 
if such a subset exists for any natural d. It is well known (e.g., 0) that, for 
any two classes C and H (over X) of VC-dimension d, the class of negations 
{c\X \ c G C} has VC-dimension d, and the class of unions {c U h\c &C, h G H} 
has VC-dimension at most 2max {VCdim(C), VCdim(Tf)} -1- 1 = 0{d). Following 
0 we deflne the dual of a concept class: 

Definition 3. The dual Tl^ C {0, 1}^ of a class H C {0, 1}'^ is defined to be 
the set {x-^\x € X} where x-^ is defined by x-^(h) = h{x) for all h €TL. 

If we view a concept class Tt as a, boolean matrix M-h where each row represents 
a concept and each column a point from the instance space, X, then the matrix 
corresponding to Ti.^ is the transpose of the matrix M-h- The following claim, 
from 0, gives a tight bound on the VC dimension of the dual class: 

Claim 1: For every class H, VCdim(7f) > [log VCdim(7t^)J . 

In the following discussion we limit ourselves to instance spaces X of finite 
cardinality. The main use we make of the VC-dimension is in constructing a- 
nets. The following definition and theorem are from 0: 

Definition 4. A set of points Y C X is an a-net for concept class Tt C {0,1}^ 
under distribution D over X, if for every h gTI such that D{h) > a, V C /i fy 0. 



Theorem 1. For any class Tt C {0, 1}^ of VC-dimension d, any distribution 
D over X , and any a>0,6>0, if m> max { ^ log |, ^ log examples are 
drawn i.i.d. from X according to the distribution D, they constitute an a -net for 
Tt with probability at least 1 — i5. 



In j^, Talagrand proved a similar result: 
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Definition 5. A set of points Y <Z X is an a-sample for the concept class 
n C {0, 1}'^ under the distribution D over X, if it holds that every h G Ti. 



satisfies 



D{h) - ^ 



< a. 



Theorem 2, There is a constant ci, such that for any class TL C {0,1}^ of 
VC-dimension d, and distribution D over X , and any a > 0, 5 > 0, if m > 
^ {d + log J-) examples are drawn i.i.d. from X according to the distribution D, 
they constitute an a-sample for Ti. with probability at least 1 — (5. 



2.4 Consistency Algorithms 

Let P and N be subsets of points from X. We say that a function h : X {0,1} 
is consistent on {P, N) if h{x) = 1 for every “positive point” x G P and h{x) = 0 
for every “negative point” x G N. A consistency algorithm (see jS|) for a pair 
of classes (C,7L) (both over the same instance space A), receives as input two 
subsets of the instance space, {P,N), runs in time t{\P U iV|), and satisfies the 
following. If there is a function in C that is consistent with {P,N), the algorithm 
outputs “YES” and some h G Ti. that is consistent with (P,N), or “NO” if no 
consistent h G Ti, exist (there is no restriction on the output in the case that 
there is a consistent function in Ti but not in C). 

Given a subset of points of the instance space Q C A, we will be inter- 
ested in the set of all possible partitions of Q into positive and negative ex- 
amples, such that there is a function h G Ti and a function c G C that are 
both consistent with this partition. This may be formulated as: S'con(Q) = 
{P I CON(P, Q\P) = “YES”} where CON is a consistency algorithm for (C,Ti). 
Bshouty 0 shows the following, based on the Sauer Lemma m- 

Lemma 1. For any set of points Q, \Scon(Q)\ < |q| 

Furthermore, an efficient algorithm for generating this set of partitions (along 
with the corresponding functions h G Ti) is presented, assuming that C is PAC- 
learnable from Ti of constant VC dimension. The algorithm’s output is denoted 

ScoN{Q) = {{{P,Q\P),h) I PgScon{Q) and h is consistent with (P,Q\P)}. 

3 Information Theoretic Lower Bound 

In this section we show that no learning algorithm (not even inefficient ones) 
can learn a “non-trivial” concept class with accuracy e better than 2rj under the 
NSN model; in fact, we prove that this impossibility result holds even for the 
NCN model. 

Definition 6. A concept class C over an instance space X is called non-trivial 
if there exist two points X\,X 2 G X and two concepts ci,C 2 G C, such that 
Ci(a;i) = C 2 (a:i) and Ci{x 2 ) yf C 2 {x 2 )- 

Theorem 3. Let C be a non-trivial concept class, p be a noise rate and e < 2r] 
be an accuracy parameter. Then, there is no algorithm that learns the concept 
class C with accuracy e under the NCN model (with rate rj). 
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Proof sketch: We base our proof on the method of induced distributions 

introduced in UHl Theorem 1 ] . We show that there are two concepts ci , C2 € C 
and a probability distribution D such that Pr£>(ciAc2) = 2 rj and an adversary 
can force the labeled examples shown to the learning algorithm to be distributed 
identically both when ci is the target and when C2 is the target. 

Let Cl and C2 be the two concepts whose existence is guaranteed by the fact 
that C is a non-trivial class, and let a;i,a;2 € X he the two points that satisfy 
ci(a;i) = C2(a;i) and Ci{x2) yf C2{x2)- We define the probability distribution D to 
be D{xi) = l — 2 rj, D{x2) = 2 r], and D{x) = 0 for all a; G df \ {a;i,a;2}. Clearly, 
we indeed have Pr£)(ciAc2) = PrD(x2) = 277. 

The adversary will modify exactly half of the sample points of the form (x 2 , e) 
to (a;2 , 1 — e). This would result with the learning algorithm being given a sample 
effectively drawn from the following induced distribution: 

Pr(a;i, ci(a:i)) = 1 — 2?7 and Pr(a;2, ci (2:2)) = Pr(a;2, 02(2:2)) = » 7 - 

This induced distribution would be the same no matter whether the true target 
is Cl or C2- Therefore, according to the sample that the learning algorithm sees, 
it is impossible to differentiate between the case where the target function is ci 
and the case where the target function is C2- □ 

Note that in the above proof we indeed take advantage of the “nastiness” 
of the adversary. Unlike the malicious adversary, our adversary can focus all its 
“power” on just the point 2:2, causing it to suffer a relatively high error rate, 
while examples in which the point is 2:1 do not suffer any noise. Finally, since 
any NCN adversary is also a NSN adversary. Theorem 0 implies the following: 

Corollary 1. Let C he a non-trivial eoncept elass, rj > 0 be the noise rate, and 
e < 2rj he an aecuraey parameter. There is no algorithm that learns the eoneept 
class C with accuracy e under the NSN model, with noise rate rj. 

4 Information Theoretic Upper Bound 

In this section we provide a positive result that complements the negative result 
of Sect.O This result shows that, given a sufficiently large sample, any hypothesis 
that performs sufficiently well on the sample (even when this sample is subject 
to nasty noise) satisfies the PAC learning condition. Formally, we analyze the 
following generic algorithm for learning any class C of VC-dimension d, whose 
inputs are a certainty parameter <5 > 0, the nasty error rate parameter rj < ^ 
and the required accuracy e = 2?7 + Z\: 

Algorithm NastyConsistent: 

1 . Request a sample S = {(2;, bx)} of size m> ^ (d + log |) 

2 . Output any h £ C such that |{2: G S' : h{x) ^ bx}\ < 777(77 + Z\/ 4 ) (if no such 
h exists, choose any h £ C arbitrarily). 



Theorem 4. Let C be any class of VC-dimension d. Then, (for some constant 
c) algorithm NastyConsistent is a PAC learning algorithm under nasty sample 
noise of rate rj. 
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Proof. By Hoeffding’s inequality uni, with probability 1 — S/2 the number of 
sample points that are modified by the adversary E is at most m(r] + A/ 4). 

Now, we note that the target function ct, errs on at most E points of the 
sample shown to the learning algorithm (as it is completely accurate on the non- 
modified sample Sg). Thus, with high probability Algorithm Nasty Consistent 
will be able to choose a function h G C that errs on no more that {rj + A/ 4)m 
points of the sample shown to it. However, in the worst case, these errors of the 
function h occur in points that were not modified by the adversary. In addition, 
h may be erroneous for all the points that the adversary did modify. Therefore, 
all we are guaranteed in this case, is that the hypothesis h errs on no more 
that 2E points of the original sample Sg. By Theorem |21 there exists a constant 
c such that, with probability 1 — 5/2, the sample Sg is a ^-sample for the 
class of symmetric differences between functions from C . By the union bound we 
therefore have that, with probability at least 1 — 5, E < {rj + A/4)m, meaning 
that |S'g n (ctA/i)| < (2?7 + A/2)m, and that Sg is a Z\/2-sample for the class of 
symmetric differences, and so: Pi'£i[(ct A/i)] < 2?7 + Z\ = e as required. □ 



5 Composition Theorem for Learning with Nasty Noise 

Following 13 and j3, we define the notion of “composition class”: Let C be a 
class of boolean functions g ■. X ^ {0, 1}". Define the class C* to be the set of 
all boolean functions F{x) that can be represented as f{gi{x), . . . ,gk{x)) where 
/ is any boolean function, and gi G C for i = 1, . . . ,k. We define the size of 
f{gi, . . . ,gk) to be k. Given a vector of hypotheses {hi, . . . ,ht) G Ti.^ define the 
set yV{hi, . . . ,ht) to be the set of sub-domains Wa = {x\{hi{x), . . . , ht{x)) = a} 
for all possible vectors a G {0, 1}*. 

We now show a variation of the algorithm presented in |3 that can learn 
the class C* with a nasty sample adversary, assuming that the class C is PAC- 
learnable from a class 7L of constant VC dimension d. The algorithm builds 
on the fact that a consistency algorithm CON for {C,TL) can be constructed, 
given an algorithm that PAC learns C from Ti 0. This algorithm can learn the 
concept class C* with any confidence parameter 5 and with accuracy e that is 
arbitrarily close to the lower bound of 2?7, proved in the previous section. Its 
sample complexity and computational complexity are both polynomial in k, 1/5 
and 1/A, where A = e — 2rj. 

The algorithm is based on the following idea: Request a large sample from 
the oracle and randomly pick a smaller sub-sample from the sample retrieved. 
The random choice of a sub-sample neutralizes some of the power the adversary 
has, since the adversary cannot know which examples are the ones that will 
be most “informative” for us. Then use the consistency algorithm for (C,7L) to 
find one representative from Ti. for any possible behavior on the smaller sub- 
sample. These hypotheses from Ti now define a division of the instance space 
into “cells”, where each cell is characterized by a specific behavior of all the 
hypotheses picked. The final hypotheses is simply based on taking a majority 
vote among the complete sample inside each such cell. 

To demonstrate the algorithm, we consider (informally) the specific, relatively 
simple, case where the class to be learned is the class of k intervals on the line 
(see Fig.P). The algorithm, given a sample as input, proceeds as follows: 
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1. The algorithm uses a “small”, random sub-sample to divide the line into 
sub-intervals. Each two adjacent points in the sub-sample define such a sub- 
interval. 

2. For each such sub-interval the algorithm calculates a majority vote on the 
complete sample. The result is our hypothesis. 

The number of points (which in this specific case is the number of sub-intervals) 
the algorithm chooses in the first step depends on k. Intuitively, we want the total 
weight of the sub-intervals containing the target’s end-points to be relatively 
small (this is what is called the “bad part” in the formal analysis that follows). 
Naturally, there will be 2fc such “bad” sub-intervals, so the larger k is, the 
larger the sub-sample needed. Except for these “bad” sub-intervals, all other 
subintervals on which the algorithm errs have to have at least half of their 
points modified by the adversary. Thus the total error will be roughly 2rj, plus 
the weight of the “bad” sub-intervals. 



Target Concept: I 1 

- + + + - ++ H h + + 

(Erromous) Sample points: I 1 

Sub-sample and intervals: I — X X X X X X X 1 

"Bad" "Bad" "Bad" "Bad" 



Algorithm’s hypothesis: 



Fig. 1. Example of NastyLearn for intervals. 



Now, we proceed to a formal description of the learning algorithm. Given 
the constant d, the size k of the target functioifl, the bound on the error rate 77 , 
the parameters S and A, and two additional parameters M, N (to be specified 
below), the algorithm proceeds as follows: 

Algorithm NastyLearn: 

1. Request a sample S of size N. 

2. Choose uniformly at random a sub-sample i? C S' of size M. 

3. Use the consistency algorithm for {C,Ti) to compute 

Scon{R) = {{{PuR \ Pi)M), • • • , m.R \ Pt), ht)) . 

4. Output the hypotheses H{hi, . . . , ht), computed as follows: For any Wa G 
yV{hi, . . . ,ht) that is not empty, set H to be the majority of labels in Sn Wq. 
If Wa is empty, set i? to be 0 on any x G Wa ■ 



® The algorithm can use the “doubling” technique in case that k is not given to it. In 
this case however, the sample size is not known in advance and so we need to use 
the extended definition of the Nasty Noise model that allows repetitive queries to 
the oracle. 
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Theorem 5. Let 



M = 



2Ak^ 8 C2ctk^ 78k\ 
-3- log J, — log 



and N 



W 









2‘‘d + log ■ 



where c\ and C2 are constants. Then, Algorithm NastyLearn learns the class C* 
with accuracy e = 2rj + A and confidence 6 in time polynomial in k, and 

Before commencing with the actual proof, we present a technical lemma: 

Lemma 2. Assuming N is set as in the statement of Theorem^ with prohahility 
at least 1 — j, E (the number of points in which errors are introduced) is at most 
{r] + A/12)N. 

For lack of space, the proof is omitted (see 0 for details) . We are now ready to 
present the proof of Theorem El 

Proof. To analyze the error made by the hypothesis that the algorithm generates, 
let us denote the adversary’s strategy as follows: 

1. Generate a sample of the requested size N according to the distribution D, 
and label it by the target concept F. Denote this sample by Sg. 

2. Choose a subset S'out C Sg of size E = E{Sg), where E{Sg) is a random 
variable (as defined in Sect. 12.211 . 

3. Choose (maliciously) some other set of points Sin C T x {0, 1} of size E. 

4. Hand to the learning algorithm the sample S = {Sg \ S'out) U Sin. 

Assume the target function F is of the form F = f{gi,...,gj.). For all i G 
{!,..., A:}, denote by hj., where ji G {!,..., t}, the hypothesis the algorithm 
have chosen in step 0 that exhibits the same behavior gi has over the points of R 
(from the definition of Scon we are guaranteed that such a hypothesis exists). 
By definition, there are no points from R in hj.Agi, so: 



R n (hj-Agi) = 0. 



( 1 ) 



As the VC-dimension of both the class C of all gfs and the class Tt of all hfs 
is d, the class of all their possible symmetric differences also has VC-dimension 
0{d) (see Sect. 12.311 . By applying Theorem^ when viewing i? as a sample taken 
from S according to the uniform distribution, and by choosing M to be as in 
the statement of the theorem, R will be an a-net (with respect to the uniform 
distribution over S) for the class of symmetric differences, with a = A /6k, with 
probability at least 1 — (5/4. Note that there may still be points in S which are 
in hj.Agi. Hence, we let = S D (hj.Agi). Now, by using 0 we get that 



|S| 



< ^ with probability at least 1 



For every sub-domain B G W(/ii, . 



— (5/4, simultaneously for all i. 
. . ,ht) we define: 



A 



Nb = \Sg n B\ 



^out,g A 



SnntPBf][jfihg,Agi) 



, N'^i = |5i„ n B\ 

, NT'"" = l^out n H n {jfihg^Agi)\ 



N%^\Sf^BC^[^fihgA9^)\ 
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In words, Nb and simply stand for the size of the restriction of the original 
(noise-free) sample Sg and the noisy examples Sm introduced by the adversary 
to the sub-domain B. The other definitions are based on the distinction between 
the “good” part of B, where the giS and the hj.s behave the same, and the “bad” 
part, which is present due to the fact that the g^s and the /ij.s exhibit the same 
behavior only on the sub-sample i?, rather than on the complete sample S. 

Since the hypothesis takes a majority vote in each sub-domain, then it will 
err on the domain B H if the number of examples left untouched in 

B is less than the number of examples in B that were modified by the adversary, 
plus those that were misclassified by the hj.s (with respect to the gis). This 
may be formulated as the following condition: + N'^ > Nb — — Ng. 

Therefore, the total error the algorithm may experience is at most: 



D 



[Jhj.Agi 



E ^(^)- 

B-. AfB<Af™+Af°”‘’®+2Af“ 



We now calculate a bound for each of the two terms above separately. To bound 
the second term, note that by Theorem El our choice of N guarantees Sg to be a 
6 | w(/i.^. — domain with probability at least 1 — 5/4. Note that 
from the definition of . . . , ht) and from the Sauer Lemma |21)] we have 

that |W(/ii, . . . , ht)\ < iVCdim(w ) ^ which, with Clairndand Lemma Q] yields: 

Our choice of N indeed guarantees, with probability at least 1 — 5/4 that: 



D{B)< 









Nb A 

W 6 \W{hi,...,ht)\ 



A 



E 






BeW(hi,...,ht) 



N 



From the above choice of N, it follows that Sg is also a — ^-sample for 

the class of symmetric differences of the form hj.Agi. Thus, with probability at 
least 1 — 5/4, we have: 



^D{hj^Agi) < — 



E 



ArOUt,b 

N ■ 



The total error made by the hypothesis (assuming that none of the four bad 
events happen) is therefore bounded by: 



Pr[iLAi^] < 



3A 



E 






^OUt.g 



■N 



out.b 



2fVg 



,ht) 



N 



< 2g + A = e, 



as required. This bound holds with certainty at least 1 — 5. 



□ 
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Abstract. In many learning problems, labeled examples are rare or ex- 
pensive while numerous unlabeled and positive examples are available. 
However, most learning algorithms only use labeled examples. Thus we 
address the problem of learning with the help of positive and unlabeled 
data given a small number of labeled examples. We present both the- 
oretical and empirical arguments showing that learning algorithms can 
be improved by the use of both unlabeled and positive data. As an illus- 
trating problem, we consider the learning algorithm from statistics for 
monotone conjunctions in the presence of classification noise and give 
empirical evidence of our assumptions. We give theoretical results for 
the improvement of Statistical Query learning algorithms from positive 
and unlabeled data. Lastly, we apply these ideas to tree induction algo- 
rithms. We modify the code of C4.5 to get an algorithm which takes as 
input a set LAB of labeled examples, a set POS of positive examples and 
a set UNL of unlabeled data and which uses these three sets to construct 
the decision tree. We provide experimental results based on data taken 
from UCI repository which confirm the relevance of this approach. 



Key words: PAC model. Statistical Queries, Unlabeled Examples, Positive Examples, 
Decision Trees, Data Mining 



1 Introduction 

Usual learning algorithms only use labeled examples. But, in many machine 
learning settings, gathering large sets of unlabeled examples is easy. This remark 
has been made about text classification tasks and learning algorithms able to 
classify text from labeled and unlabeled documents have recently been proposed 
(' |BM98| . flNM'TMi?R| L We also argue that, for many machine learning problems, 
a “natural” source of positive examples (that belong to a single class) is available 
and positive data are abundant and cheap. For example consider a classical 
domain, such as the diagnosis of diseases: unlabeled data are abundant (all 
patients); positive data may be numerous (all the patients who have the disease); 
but, labeled data are rare if detection tests for this disease are expensive. As a 

* This research was partially supported by “Motricite et Cognition” : Contrat par 
objectifs region Nord/Pas-de-Calais 
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(c) Springer- Verlag Berlin Heidelberg 1999 



220 



Francesco De Comite et al. 



second example, consider mailing for a specific marketing action: unlabeled data 
are all the clients in the database; positive data are all the clients who asked 
information about the product concerned by the marketing action before the 
mailing was done; but, labeled data are rare and expensive because a survey 
has to be done on a part of the database of all clients. We do not address 
text classification problems in the present paper, but they are concerned too: 
for a web-page classification problem, unlabeled web-pages can be inexpensively 
gathered, a set of web pages you are interested in is available in your bookmarks, 
labeled web-pages are fairly expensive but a small set of hand labeled web-pages 
can be designed. 

It has been proved in |D^ that many concepts classes, namely those 
which are learnable from statistical queries, can be efficiently learned in a PAC 
framework using positive and unlabeled data only. But the price to pay is an 
increase in the number of examples needed to achieve learning (although it re- 
mains of polynomial size). We consider the problem of learning with a small 
set of labeled examples, a set of positive examples and a large set of unlabeled 
examples. We assume that unlabeled examples are drawn according to some hid- 
den distribution D, that labeled examples are drawn according to the standard 
example oracle EX(f,D), and that positive examples are drawn according to 
the oracle EX{f,Df) where Df is the distribution D restricted to positive ex- 
amples. The reader should note that our problem is different from the problem 
of learning with imbalanced training sets (see |KM97| 1 because we use three 
sources of examples. In the method we discuss here, labeled examples are only 
used to estimate the target weight (the proportion of positive examples among 
all examples); therefore, if an estimate of the target weight is available for the 
problem, only positive and unlabeled data are needed. We present experimental 
results showing that unlabeled data and positive data can efficiently boost ac- 
curacy of the statistical query learning algorithm for monotone conjunctions in 
the presence of classification noise. Such boosting can be explained by the fact 
that SQ algorithms are based on the estimate of probabilities. We prove that 
these estimates could be replaced by: an estimate of the weight of the target 
concept with respect to (w.r.t.) the hidden distribution using the (small) set of 
labeled examples and estimates of probabilities which can be computed from 
positive and unlabeled data only. If the sets of unlabeled and positive data are 
large enough, all estimates can be calculated within the accuracy of the esti- 
mate of the weight of the target concept. We present theoretical arguments in 
the PAC framework showing that a gain in the size of the query space (or its VC 
dimension) can be obtained on the number of labeled examples. But as usual, 
the results could be better for real problems. 

In the last section of the paper, we consider standard methods of decision tree 
induction and examine the commonly used C4.5 algorithm described in |Qui 
In this algorithm, when refining a leaf into an internal node, the decision criterion 
is based on statistical values. Therefore, C4.5 can be seen as a statistical query 
algorithm and the above ideas can be applied. We adapt the code of C4.5. Our 
algorithm takes as inputs three sets: a set of labeled examples, a set of positive 
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examples and a set of unlabeled examples. The information gain criterion used 
by C4.5 is modified such that the three sets are used. The reader should note 
that labeled examples are used only once for the computation of the weight of 
the target concept under the hidden distribution. We provide some promising 
experimental results, but further experiments are needed for an experimental 
validation of our approach. 

2 Preliminaries 

2.1 Basic Definitions and Notations 

For each n > 1, denotes an instance space on n attributes. A concept / is a 
subset of some instance space A„ or equivalently a {0, l}-valued function defined 
on Xn- For each n > 1, let C 2^" be a set of concepts. Then C = 1JC„ denotes 
a concept class over A = IJ A„ . The size of a concept / is the size of a smallest 
representation for a given representation scheme. An example of a concept / is 
a pair {x, f{x)), which is positive if f{x) = 1 and negative otherwise. We denote 
by Pos{f) the set of all x such that f{x) = 1. If D is a distribution defined 
over X and if A is a subset of the instance space X, we denote by D{A) the 
probability of the event [x G A] and we denote by Da the induced distribution. 
For instance, if / is a concept over X such that D{f) yf 0, Df{x) = D{x)/D{f) 
if a; G Pos{f) and 0 otherwise. We denote by / the complement of the set / in 
X and fAg the symmetric difference between / and g. A monotone conjunction 
is a conjunction of boolean variables . For each x G {0, 1}", we use the notation 
x{i) to indicate the ith bit of x. If U is a subset of {xi, . . . , x„}, the conjunction 
of variables in V is denoted by 7Ja,.gyXi. If U = 0, = I. 



2.2 PAC and SQ Models 

Let / be a target concept in some concept class C. Let D be the hidden distribu- 
tion defined over X. In the PAC model |ValiS4j . the learner is given access to an 
example oracle EX{f, D) which returns at each call an example {x, f{x)) drawn 
randomly according to D. A concept class C is PAC learnable if there exist a 
learning algorithm L and a polynomial p(., ., ., .) with the following property: for 
any / G C, for any distribution D on X, and for any 0 < e < 1 and 0 < i5 < 1, 
if L is given access to EX{f, D) and to inputs e and 6, then with probability at 
least 1 — 6, L outputs a hypothesis concept h satisfying error{h) = D{fAh) < e 
in time bounded by p(l/e, 1/5, n, sfze(/)). 

The SQ-model [Kea.D.'H is a specialization of the PAC model in which the 
learner forms its hypothesis solely on the basis of estimates of probabilities. 
A statistical query over A„ is a mapping y : A„ x {0, 1} — > {0, 1} associated 
with a tolerance 0 < r < 1. In the SQ-model the learner is given access to a 
statistics oracle ST AT{f, D) which, at each query (x, t), returns an estimate of 
D{{x I xii^j = 1}) within accuracy r. Let C be a concept class over X. We 

say that C is SQ-learnable if there exist a learning algorithm L and polynomials 
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p(., .), g(., .) and r(., .) with the following property: for any f G C, for any 
distribution D over X, and for any 0 < e < 1, if L is given access to ST AT {f, D) 
and to an input e, then, for every query (y, r) made by L, the predicate y can 
be evaluated in time q{l/e,n,size{f)), and 1/t is bounded by r(l/e,n, size{f)), 
L halts in time bounded by p(l/e, n, size{f)) and L outputs a hypothesis h G C 
satisfying D{fAh) < e. 

It is clear that given access to the example oracle EX{f,D), it is easy to 
simulate the statistics oracle ST AT{f, D) drawing a sufficiently large set of 
labeled examples. This is formalized by the following result: 

Theorem 1. lKea9^ Let C he a class of concepts over X . Suppose that C is SQ 
learnahle by algorithm L. Then C is PAG learnahle, and furthermore: 

— If L uses a finite query space Q and a is a lower bound on the allowed approx- 
imation error for every query made by L, the number of calls of EX{f, D) 
is 0{\jo? log |Q|/(5) 

— If L uses a query space Q of finite VC dimension d and a is a lower bound 
on the allowed approximation error for every query made by L, the number 
of calls of EX{f, D) is 0{d/a^logl/S) 

The reader should note that this result has been extended to white noise 
PAC models: the Classification Noise model of Angluin and Laird [IAL88| : the 
Constant Partition Classification Noise Model jldecQ?) . The proofs may be found 
in and EicSZI. Also note that almost all the concept classes known 

to be PAC learnahle are SQ learnahle and are therefore PAC learnahle with 
classification noise. 

3 Learning Monotone Conjunctions in the Presence of 
Classification Noise 

In this section, the target concept is a monotone conjunction over {x \, . . . , Xn}- 
In the noise free case, a learning algorithm for monotone conjunctions is: 



Learning Monotone Conjunctions - Noise Free Case 
input: e, 5 

V = 0 

Draw a sample S of m(e, S) examples 
for i=l to n do 

if for every positive example (x, 1), x{i) = 1 then V ^ V U {xi} 
output: h = Ilx.^vXi 



It can be proved that O (l/elogl/i5 + n/e) examples are enough to guarantee 
that the hypothesis h output by the learning algorithm has error less than e 
with confidence at least 1 — i5. The given algorithm is not noise-tolerant. In the 
presence of classification noise, it is necessary to compute an estimate pl{xi = 0) 
of pi{xi = 0) which is the probability that a random example according to the 
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hidden distribution D is positive and satisfies x{i) = 0. Then only variables such 
that this estimate is small enough are included in the output hypothesis. Let us 
suppose that examples are drawn according to a noisy oracle which, on each call, 
first draws an instance x according to D together with its correct label and then 
flips the label with probability 0 < 77 < 1/2. Let us suppose that the noise rate 
77 is known, then we can consider the following learning algorithm of monotone 
conjunctions from statistics in the presence of classification noise: 



Learning Monotone Conjunctions - Noise Tolerant Case 
input: e, S , rj 

V = % 

Draw a sample S of m{t, 5, rj) examples 

the size of S is sufficient to ensure that the following estimates 
are accurate to within ejiflri) with a confidence greater that 1 — 5 
for i=l to n do 

compute an estimate 731 (xi = 0) of pi [xi = 0) 
if pi[xi = 0) < ej{2n) then V <— V U {xi} 
output: h = Ilx.^vXi 



If the noise rate is not known, we can estimate it with techniques described in 
|ATj 8^ . We do not consider that case because we want to show the best expected 
gain. 

Let qi{xi = 0) (resp. qo{xi = 0)) be the probability that a random example 
according to the noisy oracle is positive (resp. negative) and satisfies x{i) = 0. 
We have: 



/ o^ {1 - v)qi{xz = 0) - vqo{x^ = 0) 

Pi(Xi = 0) = 

Thus we can estimate pi{xi = 0) using estimates of qi{xi = 0) and qo{xi = 0). 
Then simple algebra and standard Chernoff bound may be used to prove that 
O \{in? log7i)/(e^(l — 277)^) log 1/(5] examples are sufficient to guarantee that the 
hypothesis h output by the learning algorithm has error less than e with confi- 
dence at least 1 — 5. The reader should note that this bound is quite larger than 
the noise free case one. We now make the assumption that labeled examples are 
rare, but that sources of unlabeled examples and positive examples are available 
to the learner. Unlabeled examples are drawn according to D. A noisy positive 
oracle, on each call, draws examples from the noisy oracle until it gets one with 
label 1. 

We raise the following problems: 

— How can we use positive and unlabeled examples in the previous learning 

algorithm? 

— What could be the expected gain? 

In the learning algorithm of conjunctions from statistics with noise, an es- 
timate of pi{xi = 0) is calculated using ([Ql. From usual formulas for condi- 
tional probabilities, q\{xi = 0) may be expressed as the probability q{l) that 
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a labeled example is positive according to the noisy oracle times the proba- 
bility qf{xi = 0) that a positive example (drawn according to the positive 
noisy oracle) satisfies x{i) = 0. Now, using the formula for probabilities of 
disjoint events, qo{xi = 0) is equal to the probability q{xi = 0) that an un- 
labeled example (drawn according to D) satisfies x{i) = 0 minus qi{xi = 0). 
Thus, to compute an estimate of pi{xi = 0), we use the following equations: 
qi{xi = 0) = g(l) X qf{xi = 0), qo{xi = 0) = q{xi = 0) - qi{xi = 0) and 
Pi{xi = 0) = [(1 — rj)qi{xi = 0) — r]qo{xi = 0)]/(l — 2rj). Consequently, to com- 
pute estimates of pi{xi = 0) for all i, we have to compute an estimate of g(l) 
with labeled examples, estimates of qf{xi = 0) for every i using the source of 
positive examples, and compute estimates of qo{xi = 0) for every i using the 
source of unlabeled examples. The reader should note that labeled examples are 
used only once for the calculation of an estimate of the probability that a labeled 
example is positive. Thus, we have given a positive answer to our first question: 
unlabeled examples and positive examples can be used in the learning algorithm 
of conjunctions from statistics. We now raise the second question, that is: what 
could be the expected gain? We give below an experimental answer to these 
questions. 

We compare three algorithms: 

— The first one is the learning algorithm of conjunctions from statistics where 
only labeled examples are used 

— The second one computes an estimate of q{l) from labeled examples and 
uses exact values for q/{xi = 0) and qo{xi = 0). That amounts to say that 
an infinite pool of positive and unlabeled data is available 

— The third one computes an estimate of g(l) from labeled examples and es- 
timates of q/{xi = 0) and qo{xi = 0) from a finite number of positive and 
unlabeled examples 

Each of these three algorithms outputs an ordered list V = {x^(^ip . . . , x^(^n)) 
of variables such that, for each i, pi{xa-(i) = 0) < pi{xcr(i+i) = 0). For a given 
ordered list V, and for each i, we define gi(V) = The minimal error 

of an ordered list V is defined as errormin{V) = min{error{gi(y)) | 0 < z < n} 
which is the least error rate we can hope. We compare the minimal errors for 
the three algorithms. First, let us make more precise these three algorithms. 
We recall that labeled examples are drawn from a noisy oracle, that positive 
examples are drawn from the noisy oracle restricted to positive examples, and 
that the noise rate zy is known. 



Algorithm L{LABn) 

input: a sample LAB of N labeled examples 
for i=l to n do 

qi{xi = 0) = |{(x, c) € LAB \ x{i) = 0 A c = 1}|/A 
qo{xi = 0) = |{(a;, c) € LAB \ x(i) = 0 A c = 0}|/A 
fi{xi = 0) = 

output: ordered list V = (Xo-(i), . . . ,Xa-(„)) 
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Algorithm L(LABn,POS^,UNL^) 
input: a sample LAB of N labeled examples 
9 ( 1 ) — l{(^> c) G LAB I c — 1}|/A^ 
for i=l to n do 

compute exactly qf{xi — 0 ) 
compute exactly q{xi — 0) 
qi{xi = 0) = g(l) X qf(xi = 0) 

$,{xi = 0) = q{xi =J3) - fi(xi^= 0) 
fiiXi = 0 ) = 

output: ordered list V — (a^cr(i) » • • • » ^a{n ) ) 



Algorithm L{LABn, POSm, UNLm) 
input: a sample LAB of N labeled examples, 
a sample POS of M positive examples 
and a sample UNL of M unlabeled examples 
g(l) — |{(3;, c) G LAB \ c — 1}\/N 
for i=l to n do 

qf{xi - 0) - \{{x,c) G POS I x{i) - 0}|/M 
q{xi — 0) — \{x G UNL I x{i) — 0}|/M 
qi(xi =0) = q(l) X qf{xi = 0) 

$l(xi = 0) = q{xi =J)) — qx{xij= 0) 

^l{xi = 0) = (l-’7)'il("i=_0)-^‘>0("i=«) 

output: ordered list V — jxcr(i) , ■ ■ • , Xcr(n ) ) 



Now, we describe experiments and experimental resultsQ. The concept class 
is the class of monotone conjunctions over n variables Xi, . . . , a;„ for some n. The 
target concept is a conjunction containing five variables. The class T> of distribu- 
tions is defined as follows: D £ V is characterized by a tuple (pi, . . . , Pn) S [0, p]” 
where p = 2(1 — 2“ 5) ~ 0.26; for a given D G T>, all values x{i) are selected 
independently of each other, and x(i) is set to 0 with probability pi and set to 
1 with probability 1 — p^. Note that p has been chosen such that, if each pi is 
drawn randomly and independently in [0,p], the average weight of the target 
concept / w.r.t. D is 0.5. We suppose that examples are drawn accordingly to 
noisy oracles where the noise rate 77 is set to 0.2. 

Experiment 1. The number n of variables is set to 100. We compare the av- 
erages of minimal errors for algorithms L{LABm) and L{LABn, POSao, 
UNLao) as functions of the number N of labeled examples. For a given N, 
averages of minimal errors for an algorithm are obtained doing k times the 
following: 

— / is a randomly chosen conjunction of five variables 
— H is chosen randomly in T> by choosing randomly and independently 
each Pi 

— N examples are drawn randomly w.r.t. D, they are labeled according to 
the target concept /, then the correct label is dipped with probability rj 
— Minimal errors for L{LABm) and L{LAB]s[ , POSoo,UNLao) are com- 
puted 

and then compute the averages over the k iterations. We set k to 100. The 
results can be seen in Fig. ^ The top plot corresponds to L{LABn) and the 
bottom plot to L{LABn , POSoo,U N Lao)- 
Experiment 2. We now consider a more realistic case: there are M positive 
examples and an equal number of unlabeled examples. We show that to- 
gether with a small number N of labeled examples, these positive and un- 
labeled examples give about as much information as do M labeled ex- 
amples alone. We compare the averages of minimal errors for algorithms 
L{LABn,POSm,UNLm) and L{LABm) as functions of M. The number 
n of variables is set to 100. The number N of labeled examples is set to 10. 
The results can be seen in Fig. Ql 

^ Sources and scripts can be found at 
ftp: //grappa.univ-lilleS .f r/pub/Softs/posunlab. 
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Fig. 1. results of experiments 1 and 2. For experiment 1: target size = 5; 100 
variables; 100 iterations. This figure shows the gain we can expect using free 
positive and unlabeled data. For experiment 2: target size = 5; 100 variables; 
100 iterations; size(LAB) = 10; size(POS) = size{U N L) = size{LAB2) = M; 
M ranges from 10 to 1000 by step 10. These curves show that with only 10 labeled 
examples, the learning algorithm performs almost as well with M positive and 
M unlabeled examples as with M labeled examples. 



4 Theoretical Framework 

Let C be a class of concepts over X. Suppose that C is SQ learnable by some 
algorithm L. Let / be the target concept and let us consider a statistical query 
X made by L. The statistics oracle STAT{f, D) returns an estimate of = 
D{{x I x((x, /(x))) = 1}) within some given accuracy. We may write: = 

I xi.{x, 1)) = 1 A f{x) = 1}) + D{{x \ x((a;, 0)) = 1 A f{x) = 0}) = D{{x \ 
xi.{x, 1)) = l}n f) + D{{x I x((a;,0)) = l}n/) = D{An f) + D{B n f) where the 
sets A and B are defined by: A = {x \ x((a;, 1)) = 1}, B = {x \ x((a:, 0)) = 1}. 
Furthermore, let A be any subset of the instance space X and / be a concept over 
X, we have D{A H /) = Df{A) x D{f) and D{A H /) = D{A) — D{A n /). From 
the preceding equations, we obtain that = D{f) x (Df{A) — Df{B)) + D{B). 

Now, in order to estimate D^,, it is sufficient to estimate D{f), Df{A), Df{B) 
and D{B). If we get an estimate of D{f) within accuracy a and estimates of 
Df{A), Df{B) and D{B) within accuracy /3, it can be easily shown that D{f) x 
(Df{A) — Df{B)) + D{B) is an estimate of D{f) within accuracy a + (3{3 + 2a). 

As usual, D{f) can be estimated using the oracle EX{f, D). We can estimate 
Df{A) and Df{B) using the POS oracle EX{f,Df). We can estimate D{B) 
with the UNL oracle EX{1,D). So we can modify any statistical query based 
algorithm so that it uses the EX, POS and UNL oracles. Furthermore, if the 
standard algorithm makes N queries, labeled, positive and unlabeled example 
sources will be used to estimate respectively 1, 2A^ and N queries. 

In this paper, we make the assumption that labeled examples are “expensive” 
and that unlabeled and positive examples are “cheap” . If we make the stronger 
assumption that positive and unlabeled data are free, we can estimate Df{A), 
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Df{B) and D{B) within arbitrary accuracy, i.e. /3 = 0. If Tmin is the smallest 
tolerance needed by the learning algorithm L and whatever is the number of 
queries made by L, we see that we only need labeled examples to estimate only 
one probability, say D(f), within accuracy r^in- Let Q be the query space used 
by L. Theorem n gives an upper bound of the number of calls of EX{f,D) 
necessary to simulate the statistical queries needed by L. We see that we can 
expect to divide this number of calls by VCDIM{Q). 

A more precise theoretical study remains to be done. For instance, it should 
be interesting to estimate the expected improvements of the accuracy when the 
number of labeled examples is fixed depending on the number of positive and 
unlabeled examples. This could be done for each usual statistical query learning 
algorithm. 



5 Tree Induction from Labeled, Positive, and Unlabeled 
Data 

C4.5, and more generally decision tree based learning algorithms are SQ-like 
algorithms because attribute test choices depend on statistical queries. After 
a brief presentation of C4.5 and as C4.5 is an SQ algorithm, we describe in 
Sect. 15.21 how to adapt it for the treatment of positive and unlabeled data. We 
finally discuss experimental results of a modified version of C4.5 which uses 
positive and unlabeled examples. 

5.1 C4.5, a Top-Down Decision Tree Algorithm 

Most algorithms for tree induction use a top-down, greedy search through the 
space of decision trees. The splitting criterion used by C4.5 |Qui is based on 
a statistical property, called information gain, itself based on a measure from 
information theory, called entropy. Given a sample S of some target concept, 
the entropy of S is Entropy(S) = J2i=i ~Pi^og 2 Pi where pi is the proportion 
of examples in S belonging to the class i. The information gain is the expected 
reduction in entropy by partitioning the sample according to an attribute test t. 
It is defined as 

Gain{S,t) = Entropy (S) — -j^Entropy{Sv) (2) 

vGValues{t) 



where V aluesft) is the set of every possible value for the attribute test t, Ny is 
the cardinality of the set Sy of examples in S for which t has value v and N is 
the cardinality of S. 

5.2 C4.5 with Positive and Unlabeled Data 

Let X be the instance space, we only consider binary classification problems. 
The classes are denoted by 0 and 1, an example is said to be positive if its 
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label is 1. Let POS be a sample of positive examples of some target concept /, 
let LAB be a sample of labeled examples and let UNL be a set of unlabeled 
data. Let D be the hidden distribution which is defined over X. POS is a set 
of examples {x,f{x) = 1) returned by an example oracle EX{f,Df), LAB is a 
set of examples {x,f{x)) returned by an example oracle EX{f,D) and UNL is 
a set of instances x drawn according to the distribution D. The entropy of a 
sample S is defined by Entropy (S) = —po log 2 Po —Pi log 2 _Pi- In this formula, S 
is the set of training examples associated with the current node n and pi is the 
proportion of positive examples in S. Let Dn be the filtered distribution, that 
is the hidden distribution D restricted to instances reaching the node n, let Xn 
be the set of instances reaching the node n: p\ is an estimation of Dn{f). Now, 
in the light of the results of Sect. 0 , we modify formulas for the calculation of 
the information gain. We have Dn{f) = D{Xn H f) / D{Xn). Using the equation 
D{Xn n /) = Df{Xn) X D{f), we obtain D^if) = Df{Xn) x D{f) x 1/Z?(X„). 

We can estimate Df{Xn) using the set of positive examples associated with 
the node n, we can estimate D{f) with the complete set of labeled examples, 
and we can estimate D{Xn) with unlabeled examples. More precisely, let POS'^ 
be the set of positive examples associated with the node n, let UNL'^ be the set 
of unlabeled examples associated with the node n, and let LAB\ be the set of 
positive examples in the set of labeled examples LAB, the entropy of the node 
n is calculated using the following: 



Pi = inf 



/IPOS'"! 

\JPO^ 



X 



\LABi\ \UNL\ \ 
\LAB\ ^ |P1VL"|’ ) 



;po = 1 - pi 



( 3 ) 



The reader should note that is independent of the node n. We now define 

the information gain of the node n by Gain{n,t) = Entropy(n) — J2v&Vaiues(t) 
/N'^)Entropy{nv) where Values{t) is the set of every possible value for the 
attribute test t, fV” is the cardinality of UNL"^, is the cardinality of the set 
UNL^ of examples in UNL^ for which t has value v, and nv is the node below 
n corresponding to the value v for the attribute test t. 



5.3 Experimental Results 

We applied the results of the previous section to C4.5 and called the resulting 
algorithm C4.5PosUnl. The differences as compared with C4.5 are the following: 

— C4.5PosUnl takes as input three sets: LAB, POS and UNL 

— \LABi\/\LAB\ which appears in J2) is calculated only once 

— For the current node, entropy and gain are calculated using © 

— When gain ratio is used, split information is calculated with unlabeled ex- 
amples 

— The majority class is chosen using o 

— halting criteria during the top-down tree generation are evaluated on U NL 

— When pruning the tree, classification errors are estimated with the help of 
proportions po and pi from m 
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We consider two data sets from the UCI Machine Learning Database IMM : 
kr-vs-kp and adult. The majority class is chosen as positive. We fix the sizes 
of the test set, set of positive examples, and set of unlabeled examples. These 
values are set to: 

— For kr-vs-kp: 1000 for the test set, 600 for the set of positive examples, and 
to 600 for the set of unlabeled examples 

— For adult: 15000 for the test set, 10000 for the set of positive examples, and 
to 10000 for the set of unlabeled examples 

We let the number of labeled examples vary, and compare the error rate of C4.5 
and C4.5PosUnl. For a given size of LAB, we iterate 100 times the following: all 
sets are selected randomly (for POS, a larger set is drawn and only the selected 
number of positive examples are kept), we compute the error rate for C4.5 with 
input LAB and the error rate for C4.5PosUnl with input LAB, POS and UNL. 
Then, we average out the error rates over the 1000 experiments. 

The results can be seen in Fig. 0. The error rates are promising when the 
number of labeled examples is small (e.g. less than 100). We think that the better 
results of C4.5 for higher number of examples is due to our pruning algorithm 
which does not use in the best way positive and unlabeled examples (C4.5PosUnl 
trees are consistently larger than C4.5 ones). 




Fig. 2. error rate of C4.5 and C4.5PosUnl averaged over 100 trials. The left 
figure shows the results on the kr-vs-kp data set and the right one corresponds 
to the adult data set. 



adult and kr-vs-kp were selected in this paper because they are well known 
and contain many examples. The experiments were run on all other two-class UCI 
problems (ftp : //grappa. univ-lille3 . fr/pub/Experiments/C45PosUnl). 

6 Conclusion 

In many practical learning situations, labeled data are rare or expensive to collect 
while a great number of positive and unlabeled data are available. In this paper. 
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we have given experimental and theoretical evidence that these kind of examples 
can efficiently be used to boost statistical query learning algorithms. A lot of 
work remains to be done, in several directions: 

— More precise theoretical results must be stated, at least for specific statistical 
query learning algorithms 

— C4.5 should be modified further, especially the pruning algorithm which 
must be adapted to the data types presented in this paper 

— We intend to collect real data of the kind studied here (labeled, positive and 
unlabeled) to test this new variant of C4.5 

— Our method can be applied to any statistical query algorithm. It would be 
interesting to know if it can be appropriate elsewhere 
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Abstract. We provide an algorithm to PAG learn multivariate poly- 
nomials with real coefficients. The instance space from which labeled 
samples are drawn is IR^ but the coordinates of such samples are known 
only approximately. The algorithm is iterative and the main ingredient 
of its complexity, the number of iterations it performs, is estimated us- 
ing the condition number of a linear programming problem associated 
to the sample. To the best of our knowledge, this is the first study of 
PAG learning concepts parameterized by real numbers from approximate 
data. 



1 Introduction 



In the PAC model of learning one often finds concepts parameterized by real 
numbers. Examples of such concepts appear in the first pages of well-known 
textbooks such as (Z|- The algorithmics for learning such concepts follows the 
same pattern as that for learning concepts parameterized by Boolean values. 
One randomly selects a number of elements x\, . . . , Xm in the instance space X. 
Then, with the help of an oracle, one decides which of them satisfy the target 
concept c*. Finally, one computes a hypothesis which is consistent with the 
sample, i.e. a concept which is satisfied by exactly those Xi which satisfy c*. 

A main result from Blumer et al. P| provides a bound for the size m of the 
sample above in order to guarrantee that the error of is less than e with 
probability at least 1 — <5, namely 



m> Co 




VCdim(C) , 
log 




( 1 ) 



where Cq is a universal constant and VCdim(C) is the Vapnick-Chervonenkis 
dimension of the concept class at hand. This result is specially useful when 
concepts are not discrete entities (i.e. not representable using words over a finite 
alphabet) since in this case one can bound the size of the sample without using 
the VC dimension. 

A particularly important case of concepts parameterized by real numbers is 
the one in which the membership test of an instance x G X to a, concept c in the 
concept class C can be expressed by a quantifier-free first-order formula. In this 
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case, concepts in Cn,N are parameterized by elements in R", the instance space 
X is the Euclidean space R^, and the membership of a: G X to c S C is given by 
the truth of 'I'n,N{x,c) where 'Pn,N is a quantifier-free first-order formula of the 
theory of the reals with n -|- iV free variables. In this case, a result of Goldberg 
and Jerrum 0 bounds the VC-dimension of Cn.N by 

VCdim(C„_jv) < 2nlog(8eds). (2) 

Here c? is a bound for the degrees of the polynomials appearing in and s is 
a bound for the number of distinct atomic predicates in ^n,N- 

One may say that, at this stage, the problem of PAG learning a concept c* G 
Cn,N is solved. Given £,5 > 0 we simply compute m satisfying ([[J with the VG- 
dimension replaced by the bound in Q. Then we randomly draw x\, . . . , Xm G X 
and finally we compute a hypothesis G Cn,N consistent with the membership 
oi Xi to c*, i = 1, . . . , m (which we obtain from some oracle). To obtain we 
may use any of the algorithms proposed recently to solve the first-order theory 
of the reals (cf. 

It is however at this stage that our research has its starting point by re- 
marking that, from a practical viewpoint, we can not exactly read the elements 
Xi- Instead, we obtain rational approximations Xi. As an example, imagine that 
we want to learn how to classify some kind of stones according as to whether 
a stone satisfies a certain concept c* or not. For each stone we measure N pa- 
rameters — e.g. radioactivity, weight, etc. — and we have access to a collection of 
such stones already classified. That is, for each stone in the collection, we know 
whether the stone satisfies c*. When we measure one of the parameters Xi of a 
stone, say the weight, we don’t obtain the exact weight but an approximation 
Xi- The membership of this stone to c* depends nevertheless on a; = (xi , . . . ,xn) 
and not on x. Our problem thus, becomes that of learning c* from approximate 
data. A key feature is that we know the precision p of these approximations and 
that we can actually modify p in our algorithm to obtain better approximations. 
In our example this corresponds to fixing the number of digits appearing on the 
display of our measuring instrument. 

In this paper we give an algorithm to learn from approximate data for a par- 
ticular learning problem namely, PAG learning the coefficients of a multivariate 
polynomial from the signs (> 0 or < 0) the polynomial takes over a sample of 
points. While there are several papers dealing with this problem (e.g. [mEl) 
they either consider Boolean variables, i.e. X = {0, 1}^, or they work over finite 
fields. To the best of our knowledge, the consideration of rounded-off real data 
is new. 

In studying the complexity of our algorithm we will naturally deal with a 
classical theme in numerical analysis, that of conditioning, and we will find 
the common dependence of running time on the condition number of the input 
(cf. 0). 
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2 The Problem 

Consider the class V^n polynomials of degree d in iV variables which 

have a certain fixed monomial structure. That is, fix a subset O C IN-^ such that 
for all a = (oi, . . . , Oat) G O, a\ + ■ ■ ■ + < d. Thus, the elements in V^n 

have the form 

f 

aGO 

with Ca G M, and ?/“ = Let n be the cardinality of O. We will denote 

by c the vector of coefficients of / and assume an ordering of the elements in O 
so that c = (ci, . . . , c„). Also, to emphasize the dependance of / on its coefficient 
vector we will write the polynomial above as fc- 

Our goal is to PAC learn a target polynomial fc* with coefficients c*. The 
instance space is and we assume a probability distribution T> over it. For an 
instance y G R^, we say that y satisfies c* when fc*{y) > 0. This makes V^n 
into a concept class by associating to each / G V^m concept set {y G R^ | 

f{y) > 0}. 

The error of a hypothesis is given by 

Error (c'*) = Prob (sign {fc* (y)) yf sign (/^h (y))) 
where the probability is taken according to T> and the sign function is defined 

by 

. . f 1 if z > 0 
= otherwise 

As usual, we will suppose that an oracle EX^* : R'^ — > {0, 1} is available comput- 
ing EXc- (y) = sign {fc* (y)). We finally recall that a randomized algorithm PAC 
learns fc* with error e and confidence S when it returns a concept satisfying 
Error (c^) < e with probability at least 1 — d. 

Should we be able to deal with arbitrary real numbers, the following algo- 
rithm would PAC learn fc* ■ 

Algorithm 1 

Input: N,d,e and <5 

1. Compute m using (P) and 

2. Draw m random points yb) g R'^ 

3. Use the function EXc* to obtain sign {fc* (y*-b)) for i = 1, . . . ,m 

4. From step 3, we obtain a number of linear inequalities in c 
and these inequalities can be writen in matrix form 

f Bic < 0 
\ B 2 C < 0 

5. Find any vector satisfying the system in step 4 

6. Output: 
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Note that, to execute step 5, we don’t need the general algorithms for solving 
the first-order theory over the reals mentioned in the preceding section but only 
an algorithm to find a feasible point of a linear programming instance whose 
feasible set is non-empty (since c* belongs to it). 

If real data can not be dealt with exactly, we need to proceed differently. We 
begin to do so in the next section, by discussing our model of round-off. 



3 Round-Off and Errors 

Let y G IR-^. We say that y G approximates y with precision p, 0 < p < 1 
(or that p is a measure of y with such precision) when 

\yj -yj\< p\vj\ for j = i,...,w 



Remark. The definition above is the usual definition of relative precision found 
in numerical analysis. Numbers here are represented in the form 

z = ±ax 10® 



where e G 7L, a = 0 or a G [1,10), and a is written with [logj^opj digits. The 
number a is called the mantissa of z and the number e its exponent. For instance 

3.14159 X 10° 

approximates tt with precision 10“°. 

There is a strong correlation between p and the number of digits (or bits) 
necessary to write down yj. However, this correlation does not translate into a 
definite relation since the magnitude of yj (among other things) also contribute 
to determining this number of digits. For instance the number 1.23456 x 10^^ will 
use 13 digits when written down. On the other hand, 4.00000 x 10° only needs 
one digit. And both have precision 10“°. 



In our learning problem, we fix a precision p and we measure each instance 
in our sample to obtain Consequently, when we compute Bi and i ?2 we 
do not obtain those matrices but rather some approximations B\ an^ i? 2 . C^r 
first result bounds the relative error (the precision) for the entries of Bi and i? 2 - 

Theorem 1. Let b be any entry of Bi or B 2 . If < p|yj*^| for i = 

1, . . . , TO and j = 1, . . . , N then 



\b-b\ 

\b\ 



< cr = (1 -h p)®* - 1. 



□ 
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A crucial remark at this stage is that if the feasible set of the system 



B\c < 0 

B 2 C < 0 



has empty interior, no matter how small is p, the system 



B\c < 0 

B 2 C < 0 



may have not solutions at all. Therefore, in the sequel, we will search only for 
interior solutions of the system. That is, we will search for solutions of the system 



(LPl) Be < 0 with B = 



Our problem remains to find a solution of i?c < 0 knowing only B. In the next 
section we will give a first step in this direction. 

4 Narrowing the Feasible Set 



Consider the system 



Let A = (S, —B) and 



(LP2) Be < 0. 



(LP3) 



Ax < 0 



with X € Similarly, let A = {B,—B) and 



(LP4) 



Ax < 0 



The following lemma is immediate. 

Lemma 2. For i = I, . . . ,m and j = 1, . . . , 2n, < cr = (1 + pY — 1. 



Define A as follows. For i = 1, . . . , to and j = 1, . . . , 2n let 



^ifSi, >0 



TTF if < 0- 



Now consider the system 



Ax < 0 
a; > 0. 
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Theorem 3. If x is a solution of (LP5) and x = (u,f) with u,v G K" then 
c = u — V is a solution of (LPl). □ 

Theorem 0 inspires the following algorithm. 

Algorithm 2 

Input: N, d, e and 6 

1. Compute m using (P) and (0) 

2. Get m random points 

3. Use the function EXc* to obtain sign {fc* for j = 1, . . . , m 

4. p := (3/2)1/®^- 1 

5. Measure y*-®^ with precision p to obtain y*-®\ i = 1, . . . , m 

6. Write down the system (LP2), i.e., Be < 0 

7. Transform system (LP2) to (LP4) and then to (LP5), 
as described above 

8. If there is any vector x^ satisfying (LP5) 

return := and HALT 

else 

p := 

go to step 5 



Remark. The initial value po for the precision is set to (3/2) — 1 since this 

implies a = 1/2. We actually can take for po the largest power of 2 smaller than 
( 3 / 2 ) 1 /®*- 1 . 

At each iteration of the algorithm p is squared. This corresponds to doubling 
the number of bits of the mantissas in the measures yj*^ . 

Before stepping into the analysis of Algorithm 2, we derive an upper bound 
for the relative error of A as an approximation of A. In the next statement, and 
in the rest of this paper, || || denotes the 2-norm in Euclidean space. 

Proposition 4. If p < (3/2) i/®* — 1 then for i = 1, ... ,m 



\\ai - aiW 

l|a,|| 



< 4(7. 



□ 



5 Complexity and Condition 

Algorithm 2 can be implemented on a Turing machine (modulo the oracle EXc* ) • 
Its running time its determined by two quantities: 

1) the number of iterations performed by the algorithm, and 

2) the bit-size of the rational numbers involved in the intermediate compu- 
tations. 
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Notice that the cost of each iteration is dominated by step 8 and, more 
precisely, by checking the existence of a solution of (LP5). This is a linear pro- 
gramming problem over the rationale and, as such, it can be solved in polynomial 
time by either the ellipsoid method or the interior point method (see, e.g., EDI). 

These methods work in time bounded by a polynomial of low degree in the 
dimension of the input system and linear in the largest bit-size of its entries. In 
our problem, the dimension of the input system is fixed (to x 2n) through all 
the iterations and is itself polynomial in n, d, e and | log 5| . 

The bit-size of the entries of (LP5) presents a more complicated issue. It 
depends, on the one hand, on the largest and smaller (in absolute value) quanti- 

(z) 

ties among the y) ' and, on the other hand, on the precision p with which these 
quantities are measured. The first number. 



L = max 







i<m, j < 2n, y] 



(d 



^0 



is not controlled by Algorithm 2. It is actually a random variable dependent on 
the distribution T>. The second number, the precision p, affects the bit-size of 
yj*^ as observed in Remark H 

In the rest of this paper we will focus on estimating the number of iterations 
performed by Algorithm 2. In doing so, it will be necessary to traverse the 
territory of linear programming and numerical analysis. 

Let bi G H" be the fth row of B and 6^ be the hyperplane perpendicular to 
bi and passing through the origin. Note that 6^ is the boundary of the half-space 
defined by the inequality biC < 0. 



Definition 5. For every cSH" let6i{B,c) be the acute angle, i.e. Q<0i{B,c)< 
between c and the hyperplane 6^. Also, let 



9{B,c) = min 9i{B,c). 



Finally, let the condition number of B be 



C{B) = min 



cgSoI(b) sin6*(R, c) ' 



Ftere Sol(i?) denotes the set of points c G K" such that Be < 0. We will denote 
by c any point in Sol(R) for which this minimum is attained. 



Remark. Note that c actually maximizes 9{B,c). Also, for c G Sol(R), let di be 
the distance between c and 6^. Then, di = ||c|| sin0i(i?,c). So, we can rewrite 
C{B) as 



C(B) = min 



cGSoi(B) mini<rn di ' 



The expression can be seen as the (normalized) distance from c to the 

boundary ofSol(B). So, c is a solution of (LPl) having a maximal distance to 
this boundary and C{B) is the inverse of this distance. 
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Intuitively, if C(B) is small (i.e. if 9(B,c) is large) a greater error can be 
allowed in the coefficients of B and we may need less iterations in Algorithm 2. 
The following theoren, our main result, quantifies this fact. 



Theorem 6. IfC{B) < oo then the algorithm will halt and return a solution c^. 
Furthermore, the number of iterations is bounded by the smallest integer greater 
than 



log2 



^log2 ( 



1 + 



4V2C(B) 

log2(Po) 



(i) 



- 1 



V 






where po is the value of p set in step 4- 



□ 



Remark. Theorem n bounds the number of iterations in Algorithm 2 as a function 
of C{B). We end this section by noting that C{B) is actually a random variable 
since B depends on the random sample y^^\ . . . , y^'^K The number of iterations 
in Algorithm 2 is a random variable as well. Its expected value can be bounded 
by replacing C{B) by its expected value in the bound of Theorem\H 

6 A Characterization of C{B) 

Condition numbers are defined in numerical analysis mainly for continuous func- 
tions (fi : H" ^ H"*. At a point x £ K", the condition number p{x) measures 
the largest possible value 

||y)(a:-h A) -y)(a:)|| 

Pll 

over all infinitesimal perturbations A of the point x. 

A recurrent theme is the relation between p{x) and the distance from x to the 
set S of ill-posed inputs, i.e. to the set of points x £ H" such that p,{x) = oo. For 
a number of problems, the condition number p,{x) is the inverse to the distance 
from X to E (often multiplied by ||a:|| to scale properly). 

For computational problems which are not describable by a function ip as 
above the definition of condition number is less clear (we have been very sketchy 
here, for a more detailed discusion on condition numbers see BE])- For linear 
programming problems, several condition numbers were defined in the last few 
years (e.g. 0CS1). The one introduced by Renegar is defined precisely in terms 
of distance to ill-posedness. 

Let Be < 0 be a feasible system, i.e. Sol(i3) 0. Also, let 

D{B) = sup{A I ma,x\bij — < A Sol(i3') 0}. 

Renegar defines the condition number of B to be 

_ max|6y | 
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A variation of Renegar’s condition number, also in the spirit of the inverse 
to the distance to infeasibility (but normalized differently) is the following. Let 

max **^;~,f^** < Z\ ^ Sol(R') ^ 0 

and 



D*{B) = snp\A 



C*{B) = 



1 



D*{B)' 

In this section, we will state some relationships between C{B), C*{B) and 
C{B). 

Theorem 7. C{B) = C*{B). □ 



Proposition 8. C*{B) < y/nC{B). 



One can prove that an upper bound for C{B) in terms of C*{B) with the 
format of Propositional — i.e. with the form C{B) < f{n,m)C*{B) for some 
function / of the dimensions n and m — can not exist. A key difference be- 
tween C{B) and C*{B) is that while both condition numbers are homogeneous 
of degree zero in B (i.e. C{\B) = C{B) for all A > 0), C*{B) is actually multi- 
homogeneous in its rows and C{B) is sensitive to differences in row scaling. The 
next two propositions make this more precise. 

Proposition 9. Let Ai, . . . , Am > 0 and b[ = Xibi for z = 1, . . . , m. Denote by 
B' the matrix whose ith row is b[. Then C*{B) = C*{B'). □ 



Proposition 10. 

C-(B)> , C(B). 

v'nmaxij |6y| 

□ 
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Abstract. Recently, Kearns and Singh presented the first provably ef- 
ficient and near-optimal algorithm for reinforcement learning in general 
Markov decision processes. One of the key contributions of the algorithm 
is its explicit treatment of the exploration-exploitation trade off. In this 
paper, we show how the algorithm can be improved by substituting the 
exploration phase, that builds a model of the underlying Markov deci- 
sion process by estimating the transition probabilities, by an adaptive 
sampling method more suitable for the problem. Our improvement is 
two-folded. First, our theoretical bound on the worst case time needed 
to converge to an almost optimal policy is significatively smaller. Second, 
due to the adaptiveness of the sampling method we use, we discuss how 
our algorithm might perform better in practice than the previous one. 



1 Introduction 

In reinforcement learning, an agent faces the problem of learning how to behave 
in an unknown dynamic environment in order to achieve a goal. Instead of re- 
ceiving examples as in the supervised learning model, the learning agent must 
discover by interaction with the environment how to behave to get the most 
reward 0. 

Reinforcement learning has been receiving increasing attention in the last 
few years from both, machine learning practitioners and theoreticians. The for- 
mal modeling of the environment as a Markov decision process (see next section 
for a formal definition) makes it particularly suitable for obtaining theoretical 
results that might also be applicable to real-world problems. Recently, learning 
theoretical style results have been obtained, being the algorithm of Kearns 
and Singh p| one of the most relevant. The algorithm is the first reinforce- 
ment learning algorithm that provably achieves near-optimal performance in 
polynomial time for general Markov decision processes, in contrast with pre- 
vious asymptotic results. One of the key contributions of the algorithm is an 
explicit treatment of the exploration versus exploitation dilemma inherent to 
any reinforcement learning problem. In other words, any reinforcement learning 

* Supported by the EU Science and Technology Fellowship Program (STF13) of the 
European Commission. 
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algorithm has to spend some time obtaining information about the environment, 
the exploration phase, and then, use that information to discover an almost opti- 
mal policy, the exploitation phase. Obviously, a too long exploration phase might 
lead to a poor bound while, a too short one, might not allow the algorithm to 
find an almost optimal policy. The algorithm provides an explicit method for 
deciding when to switch between the two phases. 

A close look at the analysis of the algorithm reveals that the factor that 
strongly dominates the time bound (a polynomial of degree 4 in all the relevant 
problem parameters) comes from the sampling process used by the authors to 
estimate the transition probabilities during the exploration phase. Not only the 
bound is too large for any practical purposes, it also fails to satisfy the follow- 
ing intuitively desirable property. Given a certain state i in a Markov decision 
process, suppose that one of the transition probabilities from state i to state j 
executing action a has a very high probability, for instance 0.9. Intuitively, most 
of the time when we apply action a from state i we will land on state j and then, 
we should be able to realize very quickly that the probability of that particular 
transition is large. On the other hand, suppose that another state i' has transi- 
tion probabilities uniformly distributed among all the reachable states. In this 
case, when applying action a from i' we will be landing all the time in different 
states. Intuitively, in this case, more experience will be required to obtain a good 
approximation of all these transition probabilities. However, the E^ algorithm 
will execute action a from both states, i and i' exactly the same number of 
times. In other words, the number of times that we need to execute an action 
from certain state in the exploration phase is fixed in advance to a worst case 
bound independent of the underlying Markov decision process. 

The improvement proposed here solves the two problems just mentioned. We 
substitute the static batch sampling method used in the original E^ algorithm 
by a sequential sampling method adapted to this problem from the one proposed 
in 1 ^ 3 . This algorithm does sampling sequentially and has a stopping condition 
that depends on the current estimated. Thus, the amount of sampling needed to 
estimate certain transition probability will depend adaptively on the unknown 
underlying value being estimated, satisfying the intuition outlined above. In 
other words, our version of E^ will perform differently depending on how it is 
the underlying Markov decision process. That is it, it will adapt to the situation 
at hand instead of being always in the worst case. 

Moreover, even in the worst case, we will still use less amount of examples 
than the fixed worst case bound provided in ^ . Due to the nature of our sampling 
method, we can obtain estimators that are multiplicatively close to the original 
probabilities instead of additively close as in the original E^. This will allow us to 
modify the proof of correctness so that, the amount of sampling required will be 
smaller, even in the worst case. Since the time spend estimating the underlying 
model dominates the overall time bound, this will lead us to reduce it from a 
polynomial of degree 4 to a polynomial of degree 2 in worst case, a significant 
reduction. 



Faster Near-Optimal Reinforcement Learning 243 

Adaptive sampling has been studied since long time ago (see, for instance, 
the book by Walt HI3) and has also been recently used in the context of database 
query estimation 0 and knowledge discovery m- Furthermore, adaptivity is a 
very desirable property for an algorithm that is expected to be used in practical 
applications. See the discussion about the relevance of adaptivity in the context 
of learning and discovery science in m- 

As noted by the authors, a practical implementation based on the algorithmic 
ideas provided by them would enjoy performance on natural bounds that is 
considerably better than what their bounds indicate. In fact, our improvement 
corroborates that intuition since our modification uses all the ideas strongly 
related to the reinforcement learning problem proposed while substituting the 
method of sampling used by a more appropriate one for this problem in an ad- 
hoc manner. It is important to notice that our method while being more efficient 
also keeps the same theoretical guarantees of reliability as the one used by the 
original algorithm. 

This paper is organized as follows. In Section E| we provide the formal defi- 
nitions related to reinforcement learning. Then, we move on Section |2] where we 
review in detail the algorithm and its proof. In Section 0 we show how to 
modify the E^ algorithm by the adaptive sampling method and sketch a proof 
of its correctness. We will conclude in Section Ejdiscussing our result and future 
related work. 

2 Preliminaries and Definitions 

A Markov decision process (MDP) M can be defined by a tuple {S, A,T, R) 
where: S consists of a set of states S = {1, . . . , N}; A is a set of actions A = 
{ai, . . . , Ofc}; T represents the set of transition probabilities for each state-action 
pair (i,a) where ^ specifies the probability of landing in state j when 

executing action a from state f in M 0; i? represents the rewards R : S 
[0, Rmax] where R{i) is the reward obtained by the agent in state i. 

A policy defines how the agent behaves on the Markov decision process. 
More formally, a policy tt in a Markov decision process M over a set of states 
{1, . . . , A^} with actions {ai, . . . , Ofc} is a mapping tt : {1, . . . , N} — > {oi, . . . , Cfc}. 

Once a Markov decision process and a policy have been defined, we will dis- 
cuss about how good is that policy using the two standard asymptotic measures 
of return of a policy: the expected average return and the expected discounted 
return. Since the goal of the E^ algorithm, as well as our improved version, is 
to obtain finite time convergence results, in this draft we will be talking only 
about T-step expected average return and this can be translated to expected 
discounted return through the horizon time 1/(1 — 7 ) or to the expect average 
return through the mixing time of the optimal policy as described in |S| . 

Let M be a Markov decision process and a let tt be a policy in M . The T-step 
( expected) average return from state i is defined as T) = PrM[p]C^M(p) 



^ Note that Puiij) = 1 for any pair (i, a). 
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where the sum is over all T-paths pin M that start at i, Pr][^[p] is the probability 
of crossing path pin M executing tt and Um{p) is the average return along path 

p defined as Um{p) = ^{Rii H 1- Rit)- Moreover, we define the optimal T-step 

average return from i in M by V"^ (i, T) = max,rC^M(*j T). 

Thus, the goal in reinforcement learning will be the following. Given param- 
eters e (the error), S (the confidence) and T (the time horizon in the case of 
discounted return and the mixing time of the optimal policy in the case of in- 
finite horizon average return) we would like to have a reinforcement learning 
algorithm that with probability larger than 1 — i5 obtains a policy such that the 
expect return of the policy is e close to the optimal one. 



3 The Algorithm 



Before going into the details of our improved version, we will need to review 
in certain detail algorithm and its proof of correctness and reliability. 

The algorithm is what is usually called an indirect or model-based rein- 
forcement learning algorithm. That it is, it builds a partial model of the un- 
derlying Markov decision process and then, it attempts to compute an optimal 
policy from it. Thus, there are two phases clearly distinct that we review in the 
following. 

One phase consists in obtaining knowledge of the underlying Markov model 
from experience on it. The kind of experience that the algorithm has access 
to consist of, given that the agent is in a particular state at a a particular 
time step, choose the action to perform, execute it and land on a (possibly) 
different state from where a further experiment can be performed. Kearns and 
Singh coined this action as balanced wandering meaning that upon arrival in a 
state, the algorithm tries the action that has been tried the fewest number of 
times. Since we assume that the world where the agent moves is modeled with a 
MDP, the state reached when choosing an action from certain state is determined 
according to the transition probabilities distribution. This experience is gathered 
at each state the algorithm visits and used to build an approximate model of 
the transition probabilities distribution in the obvious way. This phase is what 
it is known as the exploration phase. 

The other phase, known as the exploitation phase, consists on making use of 
the statistics gathered so far to compute a policy that, hopefully, at some point 
will be close to the optimal one in the sense made precise in Section |3 

These two phases are interleaved during the learning process. That is, the 
algorithm collects statistics by experimenting on the MDP (therefore, it is in the 
exploration phase) and at some point, it decides to switch to the exploitation 
phase and attempts to compute an optimal policy using the approximate model 
constructed so far. Whenever the policy is still not close enough to the optimal, 
it goes back to the exploration phase and tries to collect new statistics about 
the unknown part of the underlying MDP and so on. Thus, it is important to 
notice that the algorithm might choose in some cases not to build a complete 
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model of the underlying MDP if doing so it is not necessary to achieve a close 
to optimal return. 

One of the main contributions of the algorithm was to provide an explicit 
method for deciding when to switch between phases, hence, the name Explicit 
Explore or Exploit (E^) for the algorithm. For this, a crucial definition was that 
of a known state, a state that has been visited enough number of times so that 
the estimates of the transition probabilities from that state are close enough to 
their true values. We will state here their definition for future comparison with 
the new one we will provide in Section ^ Recall that e, S and T are the input 
parameters of the algorithm as described in Section 

Definition 1. Let M he a Markov decision process over N states. We say 
that a state i of M is known if each action has been executed from i at least 
mkn = 0{{NTR max )/e)^ log(l/i5)) times. 

An important observation is that, given the definition above, we cannot do 
balanced wandering forever before at least one state becomes known. By the 
Pigeonhole Principle, at most after N{mkn — 1) + 1 steps of balanced wandering, 
some state becomes known. 

When the agent lands in a known state (either previously known or that just 
becomes known at this point for the first time), the algorithm does the following. 
The algorithm builds a new MDP with the set of currently known states. More 
precisely, if S is the set of currently known states in M, it constructs a MDP 
Ms induced on S' by M where all transitions between states in S are preserved 
and the rest are redirected to a new absorbing state that intuitively represents 
all the unknown and unvisited states. 

Notice that even though the algorithm does not known Ms it has a good 
approximation of it thanks to the definition of known state, we will refer to this 
approximation as Ms. The notion of approximation used is the following. 

Definition 2. Let M and M be two Markov decision processes over the 
same set of states. Then, we will say that M is an o-approximation of M if 
for any pair of states i and j and any action a, the following inequalities are 
satisfied: 

KAij) - a < ^m(u) < + a 

It can be shown by a straightforward application of the Hoeffding bound that 
if a state is known as defined in Definition E then Ms is a 0{{e/ (NTR^ax))^)- 
approximation of Ms. Moreover, Kearns and Singh proved a Simulation Lemma 
that stated that, in this case, the expected T-return of any policy in Ms is close 
to the expected return of the same policy in Ms. 

The other key lemma in their analysis is the Exploit or Explore Lemma that 
states that, either the optimal policy achieves high return just by using the set of 
known states S (which can be detected thanks to Ms and the Simulation Lemma) 
or the optimal policy has high probability of leaving S (which again the algorithm 
can detect by finding a exploration policy that quickly reaches the additional 
absorbing state in Ms). Thus, performing two on-line computations on Ms, the 



246 Carlos Domingo 



algorithm is provided with either a way to compute a policy with near-optimal 
return for the next T steps or to obtain new statistics on an unknown or unvisited 
state. The computation time required for this off-line computations is shown to 
be bounded by 0{N‘^T/e). 

Putting all these pieces together Kearns and Singh showed that the algo- 
rithm achieves, with probability larger than 1 — i5, a return that is e close to the 
return of the optimal policy in time polynomial in N,T, 1/e, and log(l/i5). The 
main factor in the overall bound comes from the definition of known states, that 
is, the number of times that a state needs to be visited during the exploration 
phase before we can attempt to compute an optimal policy. We refer to their 
paper for further details. 

Our improvement affects only the part concerning the exploration phase. In 
the following section we will show how a more efficient sampling method will 
allow us to declare a state as known more quickly obtaining a significant reduc- 
tion on the overall running time while keeping the same theoretical guarantees of 
reliability. Our improvement does not affect the rest of the algorithm and thus, 
the same lemmas can be used almost exactly the same way as in the original 
algorithm to proof the correctness of the overall algorithm. 

4 Knowing the States Faster: Adding Adaptivity to the 
Exploration Phase 



As we mentioned in the introduction, the key idea for improving E^ relies 
on using a different sampling method for estimating the transition probabilities. 
This will result on a substantial reduction on the number of statistics need 
before a state is declared known. For this, we need first to modify the notion of 
approximation given in Definition |21 

Definition 3. Let M and M he two Markov decision processes over the same 
set of states. Then, we say that M is an a-strong approximation of M if for 
any pair of states i and j, and any action a such that P^iij) > a, the following 
inequalities are satisfied: 

(1 - a)P^{ij) < P^fiij) < (1 + a)P^{ij) 

The differences between this definition and the original one are the following. 
We require that only the transitions that are not “too small” are approximated 
in a multiplicative way while, previously, every transition was required to be 
approximated but just in an additive way. The following key lemma related to 
the one that was already proved in ^ holds under the definition of approximation 
given above. 

Lemma 1. (Modified Simulation Lemma) Let M be any Markov decision pro- 
cess over N states and let M he an 0{e/ {NT Rmax))- strong approximation of 
M . Then for any policy tt, number of steps T and for any state i, UJ^{i, T) — e< 
Ul^{i,T)<Ulj{i,T) + e. 



Faster Near-Optimal Reinforcement Learning 247 



Proof. The proof follows the same lines of the proof of the related lemma pro- 
vided in p]. For simplicity of notation, let us denote by a = ce/{4NTRmax) 
where c is some constant smaller than 1 and let us fix throughout the proof a 
policy 7T and a start state i. 

We first consider the contribution to the return of the transitions that are 
not approximated in M, that it is, the transitions whose transition probabilities 
are smaller than a. Since we have N states, the total probability of all these 
transitions is at most aN. Moreover, the probability that in T steps we cross 
any of these transitions can be bounded by aNT. Thus, the total contribution 
to the expected return of any of these transitions is at most aNTRmax- By our 
definition of a this quantity is smaller than ce. 

Let us consider the paths that do not cross any transition whose transition 
probability is smaller than a. By our definition of a-strong approximation, for 
any path p of length T the following holds: 

(1 - afPrhlp] < Pr^[p] < (1 + afPTh[p] 

Recall that U'f^{i,T) = Y^pP t:\i\p\U m{p)- Since the inequality above holds 
for any path T, it also holds when we take the expected value. In other words, 
the following inequality can be derived from the inequality above: 

ce + (1 - afUUh T) < T) < (1 + T) + ce 

where the factor ce comes from our previous calculation of the error introduced 
by the small transitions not required to be approximated in the definition of 
a-strong approximation. Now we should show that our choice of a together with 
the inequality obtained above implies the lemma. We will show it only for the 
upper bound, the lower bound can be shown in a similar manner. For showing 
the upper bound, we need to show that the following two inequalities hold: 

[1 + a)’^U'fj{i,T) <Uf^{i,T) + e/2 and ca < e/2 

The second inequality easily follows by choosing an appropriate value for c in 
the definition of a. For the first one, showing that (1-l-a)’^ < 1 + e/ (2Rmax) holds 
will suffice since the average reward is bounded by Rmax- To see this, notice that 
from the Taylor expansion of log(l -I- a) one can show that Tlog(l -I- a) is less 
than Ta/2 and standard calculus shows that 2^ < 1 -|- 2x. Thus, choosing a 
such that Ta < e/ (2Rmax) will suffice and this can be obviously satisfied by our 
choice of a and an appropriate choice of c. □ 

As in 0, the appropriate definition of known state should be derived from 
the Simulation Lemma. We will use the following one. 

Definition 4. Let a = 0{e/ (NT Rmax)) ■ An state i will be denoted as well- 
known when for any action a such that P^(ij) > a, 

(1 - a)P^(zj) < P//p{ij) < (1 + a)P^{zj) 
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Notice that a straightforward application of Chernoff bounds is not appro- 
priate to determine how many times we need to visit a state before it can be 
declared as well-known. This is because the required approximation is multi- 
plicative and, then the number of visits needed that we will obtain if we derive 
it from the Chernoff bound will depend on the true transition probabilities, a 
number that is unknown to the algorithm. In fact, it is precisely what it is be- 
ing estimated. On the other hand, the Hoeffding bound cannot be used here to 
derive the number of necessary steps since it only provides an additive approx- 
imation while we want a multiplicative one. For the statements of the Chernoff 
and Hoeffding bounds we refer the reader to . 

To get around this difficulty, we will use a sequential sampling algorithm 
based on the one proposed by Lipton et.al. m for database query estimation. 
This algorithm will substitute the following steps of the algorithm. Recall 
that when the algorithm is in the exploration phase, it does balanced wandering. 
That is, it just executes the least used action from the state where the agent is 
currently located, updates the estimates according to the landing state and, in 
case the number of times that the state has been visited becomes larger than the 
required by the definition of known state given in Sectional then it is declared 
as known. 

1 AdaExploStep /* for state i * j 

2 if state i is visited for the first time then 

3 a = 0{e/NTRma^y, /3 = 10 ln{2kN/S)/a^ 

4 for all states j £ S and actions a G A do 

5 rua = 0-, = 0-, 

6 apply action a (least used breaking ties randomly); 

7 let j the landing state, 1; 

8 m,a = rua 1', 

9 if > 3\n{2kN/5){l -h a)/a^) then 

10 declare PM{ij) as estimated by 

11 if (rua > /3) then 

12 declare all remaining P^ii^j') as estimated by Ptiilf) = N^i{ij)/nia 

13 if all Pixi^j) are estimated for all states j and actions a then 

14 declare i as well-known. 



Fig. 1. Adaptive exploration step for state i. 



Our approach is the following. Every time we land on an unknown or un- 
visited state i in the exploration phase we will execute an Adaptive Exploration 
Step (AdaExploStep for short). A pseudo code for AdaExploStep is provided in 
Figure Hand we discuss it now. First, if the state is unvisited, we will initialize 
variables N^{ij) that will be used for estimating the real transitions probabili- 
ties Fm(o)j fhe number of times every action a has been used from it TOq and, 
the two parameters of the algorithm a and [3. Then, we will choose the action to 
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execute (the least used, breaking ties randomly), we will denote by j the landing 
state and update accordingly. Then, we will check two conditions whose 

meaning is the following. The first condition (line 9 of Figure^) controls whether 
the estimator P^i^j) = of P^{ij) is already a close to it in a mul- 

tiplicative sense as desired. Notice that for doing this we are just using value 
N^{ij). The second condition (line 11 of Figure^) checks whether we have done 
enough sampling so we can guarantee with high probability that all the tran- 
sition probabilities that have not been declared estimated yet must be smaller 
than a. Finally, when all the transition probabilities for all the actions are de- 
clared as estimated, the state is declared well-known. In the following theorem 
we discuss the reliability and complexity of the procedure just described. 

Theorem 1. Let i be a state in a Markov deeision proeess M and let a = 
0{e/NTRmax)- Then, if proeedure AdaExploStep declares state i as well-known, 
with probability more than 1—S, it does it correctly in at mostm = 0{k\n{2kN / 6) / 
a^a) steps where a is the maximum between the smallest transition probability 
and a. 

Proof. Let us start proving the correctness, that is, that the estimates pM^ij) 
output by procedure AdaExploStep satisfy Definition^with high probability. For 
this, let us fix first one particular transition probability p = Pfj{ij) and let us de- 
note by q; = 0{e/NTRmax) the desired accuracy on the estimate. Furthermore, 
let us suppose that the algorithm declares p as estimated in the first if-then con- 
dition, that is, N{p) = Nfj{ij) becomes larger than c = 31n(2fciV/i5)(l -I- afjo? . 
Notice that since iV(p) only increases at most by 1 at every step, at the stopping 
step A^(p) satisfies c < A^(p) < c-l- 1. Thus, since the estimator output by the 
algorithm is p = N{p)/rria, it can be easily verified that for any values of rria 
satisfying 

c c -I- 1 

p(l + a) - - p(i _ Q,) 

estimate p is within the desired range, that is, (1 — a)p < p < {l-\-a)p. Therefore, 
the probability of error can be bounded by the probability of stopping before 
l\ = cj (p(l -I- a)) steps (and thus, p > p(l -I- a)) plus the probability of stopping 
after I 2 = (c -h l)/(p(l — a)) steps (and thus, p < p(l — a)). Moreover, notice 
that the stopping condition is monotone in the following sense. If it is satisfied 
at step I then it will also be satisfied at any step I' > I and if it has not been 
satisfied at step I, then it has also not been satisfied at any step V < I, no matter 
what are the results of the random trials. Therefore, we need to consider only 
the two extreme points h and ? 2 - Applying the Chernoff bound to bound those 
two probabilities and by our choice of c we can conclude that the probability 
that p does not correctly estimate p is less than S/{2kN). Since there are at 
most kN transition probabilities being estimated simultaneously, by the union 
bound, the probability that any of them fails is at most S/2. 

We have just shown that, with high probability, any transition probability 
p > a that is declared as estimated by the first condition satisfies Definition 0 
More over, it takes at most (c -I- l)/(p(l -I- a)) steps. By our choice of c and 
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noticing that (1 + a)/(l — a) is smaller than 5/3 for any a less than 0.25, it 
follows that the algorithm will estimate p in at most 51n(2kN/S)/(a^p) steps. 

Now, let us discuss the second if-then condition. The meaning of this condi- 
tion is the following. If we are doing too many steps in the same state and there 
are still transition probabilities that have not been yet declared as estimated, 
then, those transitions are “small” with high probability and we can declare the 
state as well-known satisfying Definition without further steps. To show the 
correctness of this condition, suppose that the algorithm stops in the second 
condition and let q = a transition probability that was not yet declared 

as estimated. Thus, from the first if-then condition (not yet satisfied) we know 
that N{q) = N^{ij) should be smaller than c and from the second if-then con- 
dition (just satisfied) we know that rua > d, where d = Wln{2kN/ 5)/a^ . Thus, 
the probability that q is larger than a can be bounded by the probability that 
q — N{q)/d is larger than a — cjd. Applying the Hoeffding bound and by our 
choice of d and c it can be derived that this probability is smaller than 6/{kN). 
Again, using the union bound we can bound by S/2 the probability that a tran- 
sition probability larger than a is incorrectly declared classified by the second 
condition. 

Finally, the probability that the algorithm makes a mistake in either bound 
is bound by 5/2 + 5/2 and the theorem follows. □ 

We have just seen that AdaExploStep correctly declares the states as well- 
known. Moreover, if S is the set of well-known states, then Ms will be a strong 
approximation of Ms in the sense of Definition 0 and thus, by Lemma ^ it will 
appropriately simulate Ms- Thus, it can be used to compute the off-line policies 
the same way as it was used in the original algorithm. 

5 Conclusion 

We have seen how we can modify the exploration phase of the E^ algorithm 
by an adaptive sampling method so that is possible to improve its overall time 
bound. Moreover, we have argued that due to the adaptiveness of our method, 
it should be more suitable for practical purposes and part of our future work 
will be to implement our version of E^ and test it experimentally. 

Notice that algorithm E^ as well as our improvement suffer the problem of 
being polynomial in the number of states N , something that might be impractical 
in certain problems. One possible way around this problem is to consider factored 
MDP, that it is, MDP whose transition model can be factored as a dynamic 
Bayesian network. A generalization of the E^ algorithm to that case has been 
recently obtained in ^ . Our method seems to be applicable also to improve that 
generalization although some technical points need to be carefully checked and 
this will be part of our future job. 
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Abstract. When training Support Vector Machines (SVMs) over non- 
separable data sets, one sets the threshold b using any dual cost coefficient 
that is strictly between the bounds of 0 and C. We show that there exist 
SVM training problems with dual optimal solutions with all coefficients 
at bounds, but that all such problems are degenerate in the sense that 
the “optimal separating hyperplane” is given by w = 0, and the resulting 
(degenerate) SVM will classify all future points identically (to the class 
that supplies more training data). We also derive necessary and sufficient 
conditions on the input data for this to occur. Finally, we show that an 
SVM training problem can always be made degenerate by the addition of 
a single data point belonging to a certain unbounded polyhedron, which 
we characterize in terms of its extreme points and rays. 

1 Introduction 

We are given I examples (xi,yi ), . . . , {xi,yi), with e K" and yi G {—1, 1} for 
all i. The SVM training problem is to find a hyperplane and threshold (w, b) that 
separates the positive and negative examples with maximum margin, penalizing 
misclassifications linearly in a user-selected penalty parameter C > oQ This 
formulation was introduced in 0. For a good introduction to SVMs and the 
nonlinear programming problems involved in their training, see 0 or We 
train an SVM by solving either of the following pair of dual quadratic programs: 



where we used the vector notations H == (^i, . . . , ^;) A = (oi, . . . , a;). D is the 
symmetric positive semidefinite matrix defined by Dij = yiyjXi -Xj. Throughout 
this note, we use the convention that if an equation contains i as an unsummed 
subscript, the corresponding equation is replicated for all i G {1, . . . , Z}. 

* INFM-DISI, Universita di Genova, Via Dodecaneso 35, 16146 Genova, Italy 
^ Actually, we penalize linearly points for which yi{w ■ Xi + b) < 1; such points are not 
actually “misclassifications” unless j/i(w ■ Xi -|- t>) <0. 



(P) min i|lw|p + C(ElAd 



(D) max A • 1 — ^ADA 
A 

A-y = 0 



w,6,H 

y,{w ■x, + b)>l-^i 



K<C 

A, > 0 



O. Watanabe, T. Yokomori (Eds.): ALT’99, LNAI 1720, pp. 252-12221 1999- 
(c) Springer- Verlag Berlin Heidelberg 1999 
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In practice, the dual program is solved^ However, for this pair of primal- 
dual problems, the KKT conditions are necessary and sufficient to characterize 
optimal solutions. Therefore, w, 6, H, and A represent a pair of primal and dual 
optimal solutions if and only if they satisfy the KKT conditions. Additionally, 
any primal and dual feasible solutions with identical objective values are primal 
and dual optimal. The KKT conditions (for the primal problem) are as follows: 



t 



w - ^ Aij/jXj = 0 


(1) 






1 




O 

II 


(2) 






o 

II 

4 

1 

1 


(3) 


?/j(xi - w-|-6)-l-|-Ci>0 


(4) 


Aj{?/i(xj • w -h 6) - 1 -h CJ = 0 


(5) 


II 

o 


(6) 


IV 

o 


(7) 



The /ii are Lagrange multipliers associated with the they do not appear 
explicitly in either (P) or (D). The KKT conditions will be our major tool for 
investigating the properties of solutions to (P) and (D) . 

Suppose that we have solved (D) and possess a dual optimal solution A. 
Equation dO allows us to determine w for the associated primal optimal solution. 
Further suppose that there exists an i such that 0 < Ai < C. Then, by equation 
( 0 , Hi > 0, and by equation m, e. = 0. Because Xi yf 0, equation Q tells us 
that Ui{xi • w -I- 6) — 1 -I- = 0. Using ^i = 0, we see that we can determine the 

threshold b using the equation 6=1 — yi{xi ■ w). 

Once b is known, we can determine the by noting that = 0 if Ai yf C (by 
equations 0 and (jSI)), and that = 1 — yi{^i ■ w + b) otherwise (by equation 
©)• However, this is not strictly necessary, as it is w and 6 that must be known 
in order to classify future instances. 

We note that our ability to determine 6 and H is crucially dependent on the 
existence of a Xi strictly between 0 and C . Additionally, the optimality condi- 
tions, and therefore the SVM training algorithm derived in Osuna’s thesis |3|, 
depend on the existence of such a Xi as well. On page 49 of his thesis Osuna 
states that “We have not found a proof yet of the existence of such A,, or condi- 
tions under which it does not exist.” Other discussions of SVM’s ( 0 , 0 ) also 
implicitly assume the existence of such a Xi . 

In this paper, we show that there need not exist a Xi strictly between bounds. 
Such cases are a subset of degenerate SVM training problems: those problems 

^ SVMs in general use a nonlinear kernel mapping. In this note, we explore the lin- 
ear simplification in order to gain insight into SVM behavior. Our analysis holds 
identically in the nonlinear case. 
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where the optimal separating “hyperplane” is w = 0, and the optimal solution is 
to assign all future points to the same class. We derive a strong characterization 
of SVM degeneracy in terms of conditions on the input data. We go on to 
show that any SVM training problems can be made degenerate via the addition 
of a single training point, and that, assuming the two classes are of different 
cardinalities, this new training point can fall anywhere in a certain unbounded 
polyhedron. We provide a strong characterization of this polyhedron, and give 
a mild condition which will insure non-degeneracy. 

2 Support Vector Machine Degeneracy 

In this section, we explore SVM training problems with a dual optimal solution 
satisfying Ai € {0, C} for alH. 

We begin by noting and dismissing the trivial example where all training 
points belong to the same class, say class 1. In this case, it is easily seen that 
A = 0, H = 0, w = 0, and 6=1 represent primal and dual optimal solutions, 
both with objective value 0. 

Definition 1. A vector A is a {0, C}-solution for an SVM training problem V 
if A solves (D), Xi G {0,(7} for all i and A yf 0 (note that this includes cases 
where Xi = C for all i). 

We demonstrate the existence of problems having (0, (7}-solutions with an 
example where the data lie in IR^: 



13-10 8-11 
-10 8-6 8 
8-6 5-7 
-11 8 -7 10 



Suppose C = 10. The reader may easily verify that A = (10,10,10,10), 
w = 0, 6 = —1, H = (0,2, 0,2) are feasible primal and dual solutions, both 
with objective value 40, and are therefore optimal. Actually, given our choice 
of A and w, we may set 6 anywhere in the closed interval [—1, 1], and set H = 
(1 -I- 6, 1 — 6, 1 -I- 6, 1 — 6). 

We have demonstrated the possibility of {0, (7}-solutions, but the above ex- 
ample seems highly abnormal. The data are distributed at the four corners of a 
unit square centered at (1.5, 2.5), with opposite corners being of the same class. 
The “optimal separating hyperplane” is w = 0, which is not a hyperplane at all. 
We now proceed to formally show that all SVM training problems which admit 
{0, (7}-solutions are degenerate in this sense. 

The following lemma is obvious from inspection of the KKT conditions: 

Lemma 1. Suppose that A is a solution to an SVM training problem 

Vi with C = C\. Given a new SVM training problem V 2 with identical input 
data and C = C 2 , (C 2 fCi) ■ A is dual optimal for V 2 - The corresponding primal 
optimal solution(s) is (are) unchanged. 



X 


y 


(2,3) 


1 


(2,2) 


-1 


(1,2) 


1 


(1,3) 


-1 
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We see that {0, C}-solutions are not dependent on a particular choice of C. 
This in turn implies the following: 

Lemma 2. If A is a {0, C}-solution to an SVM training problem V, D • A == 0. 

PROOF:Since D is symmetric positive semidefinite, we can write D = RSR^, 
where S is a diagonal matrix with the (nonnegative) eigenvalues of D in descend- 
ing order on the diagonal, R is an orthogonal basis of corresponding eigenvectors 
of D, and RR^ = I. If D • A 0, then for some index k, ak > 0 and Rj, • A yf 0. 

For any value of C, let Ac be the {0, C}-solution obtained by adjusting A 
appropriately. This solution is dual optimal for a problem having input data 
identical to V, with a new value of C, by Lemma ^ 

i 

AcDAc = J2a,\\R, ■ Acf 

t=i 

> (Tfe||Rfc • Acll^ 

= afeC2||Rfe. Aif 

Define S to be the number of non-zero elements in A. As we vary C, the 
optimal dual objective value of our family of {0, C}-solutions is given by: 

/a(C) = Ac • 1 - ^AcDAc 

<SC-^akC^\\Tlk-Aif 



However, if 



fTfcllRfe- AiP 

/a(C'*) < 0. This is a contradiction, for A = 0 is feasible in V with objective 
value zero, and zero is therefore a lower bound on the value of any optimal 
solution to V, regardless of the value of C. 

Theorem 1. If A is a {0,C}~ solution to an SVM training problem V , mv = 0 
in all primal optimal solutions. 

Proof: 

Any optimal solution must, along with A, satisfy the KKT conditions. Ex- 
ploiting this, we see: 



0 = D A 
0 = ADA 
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I I 

= ^ ^ XiDijXj 
i=i j=i 
i i 

i=l 3 = 1 

I I 

= (XI ^^yi^i) ■ (X 

i=l j=l 



0 



This is a key result. It states that if our dual problem admits a {0,C}- 
solution, the “optimal separating hyperplane” is w = 0. In other words, it is of no 
value to construct a hyperplane at all, no matter how expensive misclassifications 
are, and the optimal classifier will classify all future data points using only the 
threshold b. Our data must be arranged in such a way that we may as well 
“de-metrize” our space by throwing away all information about where our data 
points are located, and classify all points identically. 

The converse of this statement is false: given an SVM training problem V 
that admits a primal solution with w = 0, it is not necessarily the case that 
all dual optimal solutions are {0, (7}-solutions, nor even that a {0, C}-solution 
necessarily exists, as the following example, constructed from the first example 
by “splitting” a data point into two new points whose average is one of the 
original points, shows: 



X 


y 


(2,3) 


1 


(2,2) 


-1 


(1,1.5) 


1 


(1,2.5) 


1 


( 1 , 3 ) 


-1 



D = 



13 -10 6.5 9.5 -11 
-10 8-5-7 8 

6.5 -5 3.25 4.75 -5.5 

9.5 -7 4.75 7.25 -8.5 

-11 8 -5.5 -8.5 10 



Again letting C = 10, the reader may verify that setting A = (10, 10, 5, 5, 10), 
w = 0, 6 = — 1, H = (0, 20, 0, 20, 0, 0) are feasible primal and dual solutions, both 
with objective value 40, and are therefore optimal. With more effort, the reader 
may verify that A = {10, 10, 5, 5, 10} is the unique optimal solution to the dual 
problem, and therefore no {0, Cj-solution exists. 

Although our initial motivation was to study problems with optimal solutions 
having every dual coefficient Xi at bounds, we gain additional insight by studying 
the following, broader class of problems. 



Definition 2. An SVM training problem V is degenerate if there exists an 
optimal primal solution to V in which w = 0. 



By Theorem^], any problem that admits a {0, C}-solution is degenerate. As 
in the {0, Cj-solution case, one can use the KKT conditions to easily show that 
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the degeneracy of an SVM training problem is independent of the particular 
choice of the parameter C, and that w = 0 in all primal optimal solutions of a 
degenerate training problem. 

For degenerate SVM training problems, even though there is no optimal 
separating hyperplane in the normal sense, we still call those data points that 
contribute to the “expansion” w = 0 with Ai yf 0 support vectors. Given an 
SVM training problem V, define Ki to be the index set of points in class i, 
i G {1,-1}. 

Lemma 3. Given a degenerate SVM training problem V, assume without loss 
of generality that \K-i \ < \K\\. Then all points in class —1 are support vectors; 
furthermore, Xi = C if i G K-i. Additionally, if \K-i\ = \K\\, the (unique) dual 
optimal solution is A = C. 

PROOFtBecause w = 0, the primal constraints reduce to: 

yib>l- 

If |iF-i| < \Ki\, the optimal value of b is 1, and is positive for i G \K-i\. 
Therefore, Xi = C for i G K-i (by Equations El and E|) . 

Assume \K-f\ = \Ki\. We may (optimally) choose b anywhere in the range 
[—1, 1]. If 6 < 0, all points in class 1 have Xi = C, and if 6 > 0, all points in class 
— 1 have Xi = C. In either case, there are at least \K-i\ points in a single class 
satisfying Xi = C. But equation ( EJ says that the sum of the Xi for each class 
must be equal, and since no Xi may be greater then C, we conclude that every 
Xi is equal to C in both classes. 

Finally, we derive conditions on the input data for a degenerate SVM training 
problem V . 

Theorem 2. Given an SVM training problem V , assume without loss of gener- 
ality that \K-i\ < \Ki\. Then: 

a. V is degenerate if and only if there exists a set of multipliers VI for the points 
in Ki satisfying: 

0<uji<l 

^ ^ ^ ^ aJi'X-i 

ieK-i ieKi 

= 1-^-1 1 

i&Ki 

b. V admits a {0,C} -solution if and only if V is degenerate and the tOi in part 
(a) may all be chosen to be 0 or 1. 

Proof: 

(a, =>) Suppose V is degenerate. Consider a modification of V with identical 
input data, but C = 1; this problem is also degenerate. All points in class —I 
are support vectors, and their associated Xi are at 1, by Lemma El Letting A 
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be any dual optimal solution to P, we see that letting uji = Xi for i G K\ and 
applying Equation (0 demonstrates the existence of the uji. 

(a, <1=) Given tUi satisfying the condition, we easily see that Xi = C for 
i G K-i, Xi = LUiC for i G K\ induces a pair of optimal primal and dual solutions 
to V with w = 0 using the KKT conditions. 

(b, =>) Given a {0, (7}-solution, w = 0 in an associated primal solution by 
Theorem ^ and setting loi = Xt/C for i G Ki satisfies the requirements on S7. 
(b, <1=) Let Xi = u>iC for i G Ki, and apply the KKT conditions. 



3 The Degenerating Polyhedron 



TheoremEI indicates that it is always possible to make an SVM training problem 
degenerate by adding a single new data point. We now proceed to characterize 
the set of individual points whose addition will make a given problem degenerate. 
For the remainder of this section, we assume that |iL_i| < |Ki|, and we denote 
EjGif-i by V, and |K_i| by n. 

Suppose we choose, for each i G Ki, an oji G [0, 1], satisfying n — 1 < 
SiGifi ^ clear from the conditions of Theorem0that if we add a new 

data point 



V- J^cJiXi 

i€Ki 

n- 

i€Ki 



( 8 ) 



that the problem becomes degenerate, where the new point has a multiplier given 
hy Uc = n — X^iGiCi that all single points whose additions would make 

the problem degenerate can be found in such a manner. We denote the set of 
points so obtained by Xd . 

We introduce the following notation. For k < n, we let Sk denote the set 
containing all possible sums of k points in Ki. Given a point s G Sk, we define 
an indicator function Xs ■ Ki {0, 1} with the property Xs(xi) = 1 if and only 
if Xi is one of the k points of Ki that were summed to make x. 

The region Xd is in fact a polyhedron whose extreme points and extreme 
rays are of the form V — x for x G Sn-i and §„, respectively. More specifically, we 
have the following theorem; the proof is not difficult, but it is rather technical, 
and we defer it to Appendix [3 



Theorem 3. Given a non- degenerate problem V , consider the polyhedron 

Pd = { ;^A,.(V-sP) + ^a,.(V-x‘-)|A,.,a,. >0, ^A^. = 1} 
sPeSn-i s’'GS„ spgS„_i 



Then Pd = Xd . 

An example is shown in Figure D The dark region represents the set of those 
points that, when added to the class 1, will make the problem degenerate. This 
set can be obtained following the construction in Appendix A. 
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On the one hand, the idea that the addition of a single data point can make 
an SVM training problem degenerate seems to bode ill for the usefulness of the 
method. Indeed, SVMs are in some sense not robust. This is a consequence of 
the fact that because errors are penalized in the L\ norm, a single outlier can 
have arbitrarily large effects on the separating hyperplane. However, the fact 
that we are able to precisely characterize the “degenerating” polyhedron allows 
us to provide a positive result as well. We begin by noting that in the example 
of Figure Q the entire polyhedron of points whose addition make the problem 
degenerate is located well away from the initial data. This is not a coincidence. 
Indeed, using Theorem 0 we may easily derive the following theorem: 

Theorem 4. Given a non- degenerate problem V with |AT-i| < \K\\, suppose 
there exists a hyperplane w through V/n, the center of mass of K-i, such that 
all points in K\ lie on one side of w, and the closest distance between a point in 
Ki and w is d. Then all points in the “degenerating” polyhedron Pd He at least 
(|iF_i| — 1) * d from w on the other side of w from K\. 

Using Theorem El we can easily show that if the center of mass of the points 
in the smaller class (V/n) does not lie in the convex hull of the points in the 
larger class, our problem is not degenerate, and we may apply Theorem 0 to 
bound below the distance at which an outlier would have to lie from V/n in 
order to make the problem degenerate. We conclude that if the class with larger 
cardinality lies well away from and entirely to one side of a hyperplane through 
the center of mass of the class of smaller cardinality, our problem is nondegen- 
erate, and any single point we could add to make the problem degenerate would 
be an extreme outlier, lying on the opposite side of the smaller class from the 
larger class. 

4 Nonlinear SVMs and Fnrther Remarks 

The conditions we have derived so far apply to the construction of a linear 
decision surface. It should be clear that similar arguments apply to nonlinear 
kernels. In particular, degenerate SVMs will occur if and only if the data satisfy 
the conditions of Theorem^ after undergoing the nonlinear mapping to the high- 
dimensional space. It is not necessary that the data be degenerate in the original 
input space, although examples could be derived where they were degenerate in 
both spaces, for a particular kernel choice. The important message of Theorem\B, 
however, is that while degenerate SVMs are possible, the requirements on the 
input data are so stringent that one should never expect to encounter them in 
practice. On another note, if a degenerate SVM does occur, one simply sets the 
threshold 6 to 1 or —1, depending on which class contributes more points to 
the training set. Thus in all cases, we are able to determine the threshold b. Of 
course, the wisdom of this approach depends on the data distribution. If our two 
classes lie largely on top of each other, than classifying according to the larger 
class may indeed be the best we can do (assuming our examples were drawn 
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Fig. 1. A sample problem, and the “degenerating” polyhedron: whenever a point 
in the polyhedron is added to the class 1 (circle), the problem has the degenerate 
solution w = 6 = 0 



randomly from the input distribution). If, instead, our dataset looks more like 
that of Figure n we are better off removing outliers and resolving. 

Finally, a brief remark on complexity is in order. The quadratic program 
(D) can be solved in polynomial time, and solving this program will allow us to 
determine whether a given SVM training problem V is degenerate. However, the 
problem of determining whether or not a {0, C}-solution exists is not so easy. 
Certainly, if V is not degenerate, no {0, C}-solution exists, but the converse is 
false. Determining the existence of a {0, C}-solution may be quite difficult: if 
we require the to lie in determining whether a {0, C}-solution exists is 
already equivalent to solving the weakly NP-complete problem SUBSET-SUM 
(see PI for more information on NP-completeness)H 



® Because the problem is only weakly NP-complete, given a bound on the size of the 
numbers involved, the problem is polynomially solvable. 
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A Proof of Theorem 

Theorem 5. Given a non- degenerate problem V , consider the polyhedron 

Pd = { ^A,p(V-sP)-f >0, ^A^p = 1} 

spgS„_i s'-GSn spgS„_i 

Proof: 

(a, Pd C Xd) Given a set of AxP and Ogp satisfying Agp, Ogp >0, ^ Agp = 

spgS„-i 

1, we define A = and set 

S’■&Sr^ 

1 

UJc = 



1-k A’ 



and, for i G Ki, we set 



H=0!c{ + y^AgPXgr-(x,)) 

spgS„_i spgS„_i 



Then 0 < u>i < 1 for each i G Ki, and 



E n — 1 -\- nA 1 

oJi = — : = n- 



iGifi 



1 + A 



1-k A’ 



which is in [n—l,n), so we conclude that the assigned Wi are valid. Finally, 
substituting into Equation ( 0, we find: 



V - Y. 

ieKi 

n- 

iGifi 



T - E ( E AgPXsp(x*) -k X) aspXsp(x*))x, 

iGifi spgS„_i s^gSt, 

1 

1+A 

(1 + A)V- ^AgpsP- ^ogps" 

spgS„-i spgS„ 

^Agp(P - sfy -t- ^agp(T - sfy 
spgS„-i spgS„ 
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We conclude that Pd C Xd. 

(b, Xd C Pd) Our proof is by construction: given a set of uji, i G Ki, we 
show how to choose Xgp and Os'- so that: 



AgP > 0 Vs^ e Sn-l 
'y ] XsP = 1 

spgS„_i 



asP > 0 VXr e Sn 



V- 

ieKi 

n- 

JGifi 



^Agp(P-sP) + ^agp(P 
sPGS „-1 SP^Sn 



sn 



If we impose the reasonable “separability” conditions: 



V 

n- 

JGifi 

ieKi 

n- 

iGKi 



— y ) AspR + y ' a^rV 
spgS„_i sPeS „ 



= ^Agps^’+^ogps” 
sPGS „-1 SP^Sn 



we can easily derive the following: 



= 

SPeSn 



( X^LUi + l) -n 

i€Ki 

n - Z 

iGifi 



= A 



We are now ready to describe the actual construction. We will first assign the 
Ogp, then the Agp. We describe in detail the assignment of the agp, the assignment 
of the Agp is essentially similar. We begin by initializing each Ogp to 0. At each 
step of the algorithm, we consider the “residual” : 



P - Z 

n- J^uJi 

ieKi 



J^a.p(V-sn 



( 9 ) 



Note that by expanding each s” in the n points of Ki which sum to it, we 
can represent m as a multiple of V minus a linear combination of the points of 
Ki — we will maintain the invariant that this linear combination is actually a 
nonnegative combination. During a step of the algorithm, we select the n points 
of All that have the largest coefficients in this expansion. If there is a tie, we 
expand the set to include all points with coefficients equal to the nth largest 
coefficient. Let j be the number of points in the set that share the nth largest 
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coefficient, and let k (> n) be the total size of the selected set. We select the 
^ n-k+j ) points s’’ containing the remaining max(/c— j, 0) points with the largest 
coefficients, and n — k+j of the j points which contain the nth largest coefficient. 
We will then add equal amounts of each of these s’" to our representation until 
some pair of coefficients in the residual that were unequal become equal. This can 
happen in one of two ways: either the smallest of the coefficients in our set can 
become equal to a new, still smaller coefficient, or the second smallest coefficient 
in the set can become equal to the smallest (this can only happen in the case 
where k > n.) At each step of the algorithm, the total number of different 
coefficients in the residual is reduced by at least one, so, within \Ki\ steps, we 
will be able to assign all the Ogi- (note that at each step of our algorithm, we 
increase ^ n-k+j ) cts'-)- The only way the algorithm could break down is 

if, at some step, there were fewer than n points in K\ with nonzero coefficients 
in the residual. Trivially, the algorithm does not break down at the first step 
— there must always be at least n points with non-zero coefficients initially. To 
show that the algorithm does not break down at a later step, assume that after 
assigning coefficients to the s’’ totaling k{< A), we are left with j (< n) non-zero 
coefficients. Noting that our algorithm requires that each of the j remaining 
points with non-zero coefficients is part of each s’’ with a non-zero coefficient, 
we can see that the the residual value of each of these j points is no more then 
— k. We derive the following bound on the initial sum of the coefficients, 
which we call Isum- 



Aum ^ J ( 



n- 

i&Ki 



— k) + kn 



J 



< 



n- 

t&Ki 

n — 1 
n- 

JGifi 



+ k{n - j) 



+ k 



< 



^i + ^ - n 
n — 1 i^Ki 



n- n- 

i&Ki ieiCi 



n- 



E- 

But this is a contradiction, Isum must be equal to A — . We conclude 

n- 

that we are able to assign the Os'- successfully. Extremely similar arguments hold 
for the Agp. 
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Abstract. The paper investigates whether it is possible to learn every 
enumerable classes of recursive functions from “typical” examples. “Typ- 
ical” means, there is a computable family of hnite sets, such that for each 
function in the class there is one set of examples that can be used in any 
suitable hypothesis space for this class of functions. As it will turn out, 
there are enumerable classes of recursive functions that are not learnable 
from “typical” examples. The learnable classes are characterized. 

The results are proved within an abstract model of learning from exam- 
ples, introduced by Freivalds, Kinber and Wiehagen. Finally, the results 
are interpreted and possible connections of this theoretical work to the 
situation in real life classrooms are pointed out. 



1 Introduction 

This work started with the following question. Assume a teacher has to teach 
five pupils the concepts from a given concept class. Is it always possible to come 
up with one finite set of examples for each concept, such that all pupils will learn 
the intended concept when given the corresponding set of examples? 

This seems to be a very natural question; in fact every teacher will ask herself 
the question, as to which examples she should present to her pupils to have all 
of them learn, for example, arithmetic. 

In inductive inference, a pupil is mostly represented by a learning machine 
M and a hypothesis space which is usually some numbering. The machine is 
fed examples of some unknown concept - which has to be represented vcup - and 
outputs one or more hypotheses, which are interpreted with respect to (p. The 
machine has learned the concept successfully, if its last hypothesis is a correct 
description for what it had to learn. 

Some of the results in learning theory depend fundamentally on the choice 
of an appropriate hypothesis space; see for example !oi, m This problem of 
finding a suitable hypothesis space for a target class could be avoided, if there 
were examples so “typical” for the concepts in the target class, that they would 
suffice to learn in any hypothesis space. 

In this paper, we will investigate this problem for the relatively manageable 
case of enumerable sets of recursive functions. The used learning model was intro- 
duced by Freivalds, Kinber and Wiehagen; cf. |2|. One of their intentions was to 
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model the teacher-pupil-scenario encountered real life classrooms; more on this 
in the next section. Even in this rather easy case, there are enumerable classes 
of recursive functions, that do not have “typical” examples. The paper gives 
a recursion theoretic characterization of the enumerable classes learnable from 
“typical” examples. It would be nice though, to have a deeper understanding of 
which classes fall within this characterization. 

The next section gives formal definitions, some basic results and further mo- 
tivation regarding the models used for this investigation. In section three we 
give the main results. Due to lack of space, only a few proofs could be included 
entirely in the main part of the paper, the rest is sketched. The last section is 
devoted to conclusions, open problems and an attempt to give connections of 
this theoretical work to real life teaching. It might be argued that this has no 
place in work on machine learning. But the author is convinced, that if machine 
learning might give insights into how humans learn or problems they might face 
while learning, the possibility to discuss this insights should not be passed by. 

2 Definitions, Notations, and Basic Results 

In the following, familiarity with standard mathematical and recursion theoretic 
notation and concepts, as given for example in P33, is assumed. 

■p” will denote the set of partial recursive functions of n arguments; by defini- 
tion V = -TZ will stand for the set of recursive functions. Sometimes, functions 
are identified with their graphs; for example “020°°” may stand for the function 
that is everywhere zero with exception of argument 1, where it assumes value 2. 
For a function /, f{x)l means that / is defined on argument x. A function / is 
an initial segment of a function G 7?., if domain{f) = {0, . . . , n} for some n and 
furthermore f ^ g. For any recursive /, let /" stand for a standard encoding 
of the initial segment of / with length n. This will later ease the definition of 
the learning machines, that can then be thought off as machines with a fixed 
number of input parameters. 

All functions in are called numberings. Let ip and if range over numberings 
and let rj be any fixed standard acceptable numbering; cf. m- A numbering (p 
has decidable equality, if the set {{i,j) \ Pi = Pj} is recursive. A numbering is 
called one-one, if every function appears at most once in it. A numbering p is 
said to be reducible to a numbering ip, written p ^ ip, ii there exists r G TZ 
such that Pi = ipr(i) for all i. The function r is then called a reduction. Two 
numberings p and ip are called equivalent, written p = ip, ii both p ^ ip and 
Ip d: p hold. 

Define V,p = {/ | there exists i such that pi = /}. If C then p is & 
subnumbering of ip. 

Let U range over subsets of TZ. U is said to be enumerable, if there is a 
numbering p satisfying V^p = U. 

Let p be any numbering. A numbering ex contains good examples for p if 
the following conditions are fulfilled: (1) eXi C pi for all i, and (2) there is e G TZ 
such that e(i) = card (domain (eXi)) for all i. 
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In particular, part (2) implies that domain{eXi) is finite and recursive. There- 
fore, the set of good examples {{x,exi{x)) \ x G domain{eXi)} can easily be 
computed from i, i.e. there is an effective algorithm that computes on input i 
the good examples given by exi, returns them and stops; cf. We will say “ex 
are good examples for ip’’’ synonymously for “there is a numbering ex containing 
good examples for ip” . 

An inference machine is any computable device that takes given, finite sets 
of examples to natural numbers, which will be interpreted as programs with 
respect to some previously selected numbering. 

Definition 1. (See U is called learnable from good examples with respect to 
p ( written U G Gex-Finip) if there exist good examples ex for p and an inference 
machine M , such that for all i such that pi £ U and all finite A such that 
eXi Q A C Pi the following holds: M(A)l and Pm(A) = Fi- 

Let Gex-Fin = {[/ | C/ G Gex-Fin^p for some p}. 

Note that we require only functions in the target class U to be identified. 
Furthermore, we require M to learn from any finite superset of the good exam- 
ples, as long as this set is contained in the target function. This is done in order 
to avoid some coding tricks, like presenting the only example (i^pifi)) in order 
to learn pi , which would have nothing in common with the learning problem we 
would like to model. 

Other models consider learning to be a limiting process. 

Definition 2. (See m-) An inference machine M is said to identify a recursive 
function f with respect to p in the limit if, for all n, M(/”)| = and there 
exists an ing such that im = ing for all m > no, and furthermore pi^^ = f. 

In other words, an inference machine learning some / can change its mind 
about a correct description for / a finite number of times, but must eventually 
converge to an index for / in p. 

Definition 3. U is learnable in the limit with respect to p (written U G EX,p) 
if there is an inference machine, that identifies every f £ U with respect to p. 

Define EX = {U \ U G EX,p for some p}. 

There are two major differences between these two definitions: (1) a “good- 
example-machine” receives one finite set of examples, whereas a “Gold-machine” 
over time will see every value of the target function; and (2): a good-example- 
machine is only allowed one guess, whereas a Gold-machine can change its guess 
finitely many times. 

So it would be natural to expect the relation Gex-Fin C EX. But surpris- 
ingly, Freivalds, Kinber and Wiehagen in | 2 | proved 

Theorem 1. EX C Gex-Ein. 

Beside the formal proof, there is an intuitive argument supporting this result: 
when the good-example-machine gets its examples, which are computed from 
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a program of the function to be learned, it knows that these examples are the 
important part of the target function and can concentrate its efforts on processing 
this important information. On the other hand, the Gold-machine might not 
be able to distinguish the important part of what it sees from unimportant 
information and therefore cannot learn a function it is presented, even though 
it may change its mind a finite number of times. 

Furthermore, the result states, that for every function in every EX-learnable 
class, there is such an important part of that function which suffices to identify 
it with respect to the other functions in U. 

Theorem ^ is one reason the Gex-model was used for this study, since it 
contains everything that can be learned in the limit if all examples are demon- 
strated. The other reason is, that the Gex-model seems to model the usual 
pupil-teacher-situation pretty well, which is interesting in its own right. To see 
this, imagine the good examples exi = {(x, exi(x)) \ x S domain{eXi)} to be 
computed by a teacher, who wants a pupil (M, ip) — i.e., the “learning-algorithm” 
M the pupil knows, together with the pupils “knowledge-base” ip - to learn con- 
cept ipi. The pupil is then given a superset of those examples by the teacher, 
processes them, and hopefully comes up with a representation of the intended 
concept. We will follow this thoughts in the conclusions some more. 

Suppose we want to learn an enumerable class U . It is rather easy to see 
that there exist infinitely many enumerations for any such U . Furthermore, the 
following holds: 

Proposition 1. Let U be any enumerable class and ip any numbering satisfying 
Vip = U. Then U G Gex-Finip iff {{i,j) | ipi = ipj} is recursive. 

Proof. (Sketch) “=>” Assume a) U G Gex-Fin^p, where h) ip G . Then a) is 
easily seen to imply ipi = ipj iff eXi C ipj and exj C while b) together with 
the properties of good examples implies that the latter test is recursive. 

“<^=” If equality in ip is decidable, it is very easy to compute good examples, 
since ipj ipj yields the existence of an x such that ipi{x)[ ipj{x)l. □ 

Let us now consider an arbitrary enumerable class U of recursive functions. 
The following is a well known result from recursion theory; see for references. 

Proposition 2. For every enumerable class U there exists a numbering ip such 
that Vp = U and {(f, j) | ipi = ipj} is recursive. 

So, by Proposition [Owe get that for any enumerable class U there are infer- 
ence machines able to learn U in the Gex-Fin-sense with respect to ip. Now it 
is easy to see that there are infinitely many numberings enumerating U and all 
have recursive equality. 

Let / be any function from U. Is there a set of examples so typical for /, 
such that a teacher could present those examples to any machine able to learn 
U - and hence / - and it comes up with a correct description for /? 

The results in the next section will show, that there exist classes of recursive 
functions that do not have “typical” examples. 
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3 Results 

Definition 4. Let U be an enumerable class of recursive functions. 

Hyp{U) = {ip \ U G Gex-Fin^} 

The abbreviation Hyp should remind the fact, that this set contains all suit- 
able hypothesis spaces for the class U. Now Propositions [3 and d imply that 
Hyp{U) 0 for all enumerable classes U. 

Next we define what it means for an enumerable class of functions to be 
learnable with typical examples. 

Definition 5. Let U be an enumerable class of recursive functions. We say 
Lf G Gex-Fin with typical examples, if there exist ip G Hyp(U) and good examples 
ex for Lf with respect to ip such that U G Gex-Fin,p with the good examples 
given by ex, and for all if G Hyp(U) there exist good examples ex' such that 
Lf € Gex-Fin^ with examples ex' and furthermore, for all i,j, we have that 
ipi = ifj implies exi = ex' . 

In other words, Lf is learnable from typical examples, if there is a way to 
choose the examples for any / S C7 to be equal in all suitable hypothesis spaces. 
This seems to capture our intention pretty well, since a teacher can now present 
this set of examples to all inference machines and all of them will learn /. So, 
the examples are so “typical” for / with respect to the other functions in U, that 
every machine able to learn Lf, can do so when given the “typical” examples. 

Note that this definition implies the typical examples to be uniformaly com- 
putable for each admissible hypothesis space. 

The first result in this section characterizes the enumerable sets of recursive 
functions that may be learned with typical examples. 

Theorem 2. Let Lf be an enumerable class of recursive functions. Lf G Gex-Fin 
with typical examples iff ip = if holds for all ip,if G Hyp{Lf). 

Proof. “=J>” The reduction r to be defined takes (^-indices to '0-indices by just 
searching for a function with the same set of good examples, which exists by 
assumption and definition of typical examples. 

“<j_” Fj-om Propositiondwe get that the set Hyp{ Lf) contains every one-one- 
numbering of Lf , since for those numberings equality is obviously decidable. Let 
ip be any one-one-numbering of Lf . Proposition dimplies Lf G Gex-Fin,p; let ex be 
the good examples witnessing this. Pick any if G Hyp{ Lf). Then, by assumption, 
if < ip via some r G TZ. Define ex' = eXr(i) for all i. Since ex are good examples 
for ip and ipr(i) = ifi for all i, ex' contains good examples for if. Furthermore, 
if ifi = ifj then r{i) = r{j), because ip is one-one, and therefore ex' = ex'. 
This yields that equality in if is decidable and, applying Proposition d we get 
Lf G Gex-Fin with typical examples. □ 

Proofs for the following theorem are already known, see 0 and the references 
therein for a short survey; especially |2j. A new proof is given here that allows 
the observations we need in order to make our point concerning the existence of 
typical examples. 
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Theorem 3. There exists an enumerable class of recursive functions that has 
one-one numberings tp and if such that ip ^ ip. 

The two constructed numberings being one-one implies {ip, ip} C HypifPip). 
But now Theorem 13 implies that is not learnable with typical examples. 

Proof. We will construct two numberings ip and ip in parallel and diagonalize 
against all functions possibly witnessing ip ^ ip. Recall that 77 is a numbering of 
all partial recursive functions. 

Initialize: For all i, set ipi := ipi := 0 and li := 0. Set D := Ip and n := 0. 

Step s: Set ipn '■= in '■= s, ipn '■= 1*0, D := DLi {n}, n := n-\- 1. 

For 7 = 1 to n — 1, i £ D do: 

Compute j = r]£. (i) for at most s steps. 

(1) If j is still undefined, then ipi := ipiO, i.e. extend ipi with another 

zero. 

(2) If j = i, then 

set Ipi ;= ipil°° and terminate ipi, 
set ipn := Ipi, Ipn := Ti, n:= n-\-l. 

(3) If j yf i, then ipi := ipifi°° and terminate ipi. 
next i 

(a) First note that ip and ip are computable, everywhere defined and Vip = 
V.,/,; this follows immediately from the construction. 

(b) Furthermore, ip and ip are one-one: the functions in both numberings 
take as values only 0 and 1. For every s, at most two functions start with 1*0. 
Furthermore, every function begins with 1*0 for some suitable s. If there are two 
such functions, one continues with 0 °“ and the one is terminated by ending it 
with 1 °°; cf. ( 2 ) in the construction above. 

(c) Finitely, {j \ there is i such that ii = j} = N: obviously, in every step s 
some in will assume the value s. 

So we have - by (a) and (b) - that Vy, = V.,p is an enumerable set of functions 
and both ip and ip are one-one. It remains to prove that ip yp, ip. 

Assume by way of contradiction, there is a recursive function rjc satisfying 
Tn = V'r/c(n) (*-) there exists i such that C = c. Since rji- is recursive, 

= i- There are two possible cases: 

i = j : Then ipi = for some suitable k, since the construction will be 

terminated in (2). But (pi = and therefore ipi ^ ipi. Contradiction. 

7 fy j : In this case ipi = ipi = by (3) of the construction. Since ip is 

one-one, ipj fy ipi follows. Contradiction. 

So, if Ip and hence the theorem follows. □ 

Corollary 1. Let ip and ip be the numberings constructed in the proof of Theo- 
rem^ Assume Vy, £ Gex-Finy, with good examples ex and Vy, £ Gex-Fin., 1 , with 
good examples ex' . Then there exist infinitely many i such that 

(1) exi C Ipi and ea;' C pi, 

(2) Pi fy Ipi . 
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Proof. Due to space restriction we only give an informal argument. If ipi ^ if i, 
then there exists j such that (pi = ipj and pj = ifi. The set {i \ pi = ipi\ \s 
not recursive and therefore so is the question, whether there is a j as mentioned 
above. The functions pi and ipi have the same beginning and since it is not 
known if they will differ later, the good examples - once for p and once for if 
- have to be selected from this common part, //the two functions differ, i.e. if 
there exists j as indicated above, then the good examples exj (ea;', resp.) can 
be selected as to contain a difference between pi and pj {ifi and ifj, resp.). So, 
Pi and ifi are not equal, but the examples were picked from the common part. 
This yields the assertion. □ 

Let us interpret this result. Pick any i satisfying conditions (1) and (2) of 
the corollary. Note that (1) implies exi U ex' C pi and exi U ex' C ifi. Hence, 
exi U ex' is an admissible input for any inference machine Mi witnessing V^p € 
Gex-Fin^p as well as for any inference machine M 2 witnessing V^p G Gex-Fin.^. 
Since both numberings are one-one and by definition of the inference process, we 
get PMiiexiUex'.) = Fi ^md if M 2 (exiUex’.) = 4’i- ^ 0 , both machines learn a function 
from the input eXi Uex', but since pi ^ if i, they have learned different functions 
from the same set of examples. The corollary states that this will happen for 
infinitely many concepts in regardless of how the teacher selects the examples 

for the machines Mi and M 2 and their respective hypothesis spaces p and if. 

Now we will prove the theorems for the general case. Again we stress that 
the fact stated in Theorem 0] is already known. On the other hand, the proof 
given here had to be made from scratch in order to guarantee the properties we 
need to formulate and prove Corollary |3 

Theorem 4. For each 2 < a G N there exist numberings , . . . , tp" with the 
following properties for all 0 < i, j < a: 

(1) p^ G TZ^ and is one-one, 

(2) V^i = V,pj and 

(3) z yf / implies p'' ^ pP . 

Proof. Let a > 2 be given. We will construct the numberings p^,...,p°‘ in 
parallel by diagonalizing against all possible ( 2 ) -tuples of reductions among the 
p’' . There are easier ways to prove part (3) of the theorem, but this will enable 
us to prove an analogue to Corollary Q1 

The construction below uses lists. Let [] denote the empty list. For some 
non-empty list L = [a, b,c,d\, as you would expect, Head{L) = a and Tail{L) = 
[6, c,d\. 

Now we will give a short explanation of the variables used in the construction 
and hope this will increase readability a little bit. 

In ^i we code the reductions we will diagonalize against with the set of func- 
tions beginning with l^’O. Of course we could get this information by scanning 
the beginning part of each function, but this keeps notation easier. In Li we 
keep a list of reductions, in the order we want to diagonalize against them. The 
variable rui will denote the biggest value, for which all the functions starting 
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with 1^*0 have been already defined. If we write Cy , we will mean reduction r/aj 
which is supposed to reduce to . Finally, we use bx and by to store the 
index of the “biggest” function beginning with 1^*0 in and . By “biggest” 
we mean the function / with the largest x S domain(f) such that f{x) = 1 up 
to the arguments defined until now. Note that it will be possible to compute 
those, since we have knowledge of m^. 

Initialize: pi := 0 for all j G { 1 , . . . , a}, z > 0 . 

£i := rrii ■= n := 0 and Li = [] for all z > 0. 

Step S. £ji . — S — (ci2, Ci3, . . . , Cifj, C23, ■ . . , Ca—l a) 

Ln = [ci2, C23, C34, . . . , Ca-1 a, “rest”], where “rest” is the list resulting from 
erasing C12, C23, C34,. . ., Ca_i a from the list [ci2, C13, . . . , cia, C23, • ■ • , Ca_i a] 
[* For example L„ = [ci2, C23, C34, C45, C13, C14, C15, C24, C25, C35] *] 

for j = 1 to a do; p^ := 1^"0; end for; 

n := n+ 1; 

for z = 1 to n — 1, fi yf 0 do: 

if Li = [] then [* all reductions have been taken care off *] 

for all pi, j G {1, . . . , a}, z < n, that begin with define pi{mi + 
1):=0. 
end for 

nil '■= mi + 1. 

else [* there still are reductions we need to diagonalize against *] 
c := Head(Li) = Cxy 

if X + 1 = y then [* still in the “first part” of the diagonalization *] 
compute r := rjc(j) for at most s steps, 
if r = i then 

[* so we have pf = p^ = 1 ^'OXO, where ‘X’ is some suitable 
sequence of zeros and ones of length rm — £i — 2 *] 
pf := l^toXOOO; 

pii := l^toXOlO for all 1 < j < x; 
pi := l^toXOlO for all y < j < a; 
pi^ := l^toXOOO for all y < j < a; 

for all pi, j G { 1 , ... ,a}, z < n, that start with l^’O and have 
not yet been defined for arguments nii + 1 and mi + 2, let 
piimi + 1) := pi{mi + 2) := 0. 
end for 

n := n + 1 , mi := mi + 2 ; Li := Tail{Li). 
else [* i.e., x + 1 < y, the “second part” of the diagonalization *] 
bx := index of “biggest function” beginning with l^'O in p“^ . 
by := index of p^^ in p^ . 
compute r := rjcibx) for at most s steps, 
if r = by then 

[* again pf = p£ = 1 ^'OXO; see above *] 

Pi := 1 ^- 0 X 000 , • 
pI^ := 1 ^* 0 X 010 ; 

Pi := 1 ^* 0 X 010 for all j G {!,..., a} \ {y}; 
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ipl^ := 1^- 0X000; 

for all (fi, j G a}, z < n, that start with 1 ^* 0 , and 

have not yet been defined for arguments mi + 1 and mi + 2, 
let (fii{mi + 1 ) := + 2 ) := 0 . 

end for 

n := n + 1, mi := mi + 2; Li := Tail{Li). 
if the computation of rjdi), or resp. rjc{bx) does not terminate, then 

for all (fi, j G { 1 , . . . , a}, z <n, that egin with 1 ^* 0 , define ip{{mi + 
1 ):= 0 . 
end for 

mi := mi + 1 . 

end for 

(a) Note that G TZ^ and for alH, j G {1, . . . , a} follows imme- 

diately from construction. 

(b) Every is one-one. This can be argued in the same way as in the proof 
of Theorem 0 

(c) For any (“)-tuple (ci 2 , C 13 , . . . , cia, C 23 , . . . , Ca-i a) of indices of recursive 
functions, rjca does not reduce p’' to pf . Again, this can easily be seen, since the 
construction takes care that every possible reduction is wrong. 

It remains to prove that i j implies p'^ ^ pd for i,j G {1, . . . , o}. Assume 
i < j, otherwise just exchange i and j. Suppose by way of contradiction, that 
Pk reduces p^ to pf . Hence k is an index of a recursive function. Let {k, ... ,k) 
be an (“)-tuple. Obviously, it only contains indices of recursive functions. By (c) 
we have that, for all indices Cy in this tuple, Pay does not reduce to pf . This 
contradicts our choice of k. The theorem follows. □ 



Corollary 2. Let a > 2 and p^,...,p°‘ be the numberings constructed in the 
proof of Theorem ^ Assume p’' G Gex-Fin^pi with good examples ex® for all 
i G {1, . . . , a}. Then there exist infinitely many n such that 

(1) exh C pf^ for all i,j G {1, . . . , a} and 

(2) ph^pi for alii, j G a}. 

Proof. (Sketch) A self referential argument shows, that the construction used in 
the proof of Theorem would make a mistake, if conditions (1) and (2) of the 
formulation of the corollary were not fulfilled. □ 

This corollary somewhat strengthens the remarks following Corollary 0 Here 
we have the situation that, for infinitely many n, (U^=i ^ Tn ^’^d ph ^ p^ 

for all i,k G {!,..., a}. Condition (2) now implies that every machine learns 
a different function when presented with the examples Uj=i The other 
comments apply here as well. 

Furthermore note, that there exist families {ex® }j>o,ig{i,...,a} of finite sets, 
which we also might call “good examples” for now, such that ex® C p't, iff 
Pj = pj,, for all z, i' G {!,..., a} and j,f > 0. 
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To see this, one checks the proof of Theorem 0, which yields that for every 
function / in the generated numberings only finitely many x exist, such that 
f{x) 0. So there exist families containing examples with all and only those x. 

It is easy to see, that now these sets of good examples are equal if and only 
if so are the corresponding functions. This obviously yields the assertion stated 
above. But none of this families is computable, since this would clearly contradict 
the corollary. 

On the other hand: it would be easy to compute these families “in the limit” . 
And hence, a teacher with lots of experience might know a good part of this 
typical examples for his area of expertise. 

One might ask if this phenomenon can also be achieved for infinitely many 
numberings. As the following theorem, cf. |^, shows, there is indeed an infinite 
sequence of pairwise non-equivalent numberings, all of them enumerating the 
same class of recursive functions. The proof used here is conceptionally much 
simpler than the original one in jOj. 

Theorem 5. There exists a sequence qf , i > 0, of numberings such that for all 
i,j the following properties are satisfied: 

(1) £ TZ^ , £ TZ^ and both numberings are one-one, 

( 2 ) Vipi ='Pipj , 

(3) z yf j implies v?® ^ (pT 

Proof. (Sketch) This is again a diagonalization, but this time it is made sure 
that, for i < j, p'‘ can not be reduced to gP . In essence, the argument used to 
prove Theorem 0 is repeated an infinite number of times. The only thing to take 
care of is the accounting of which functions were used in the construction, in 
order to fulfill condition (2). □ 

But there is no analogue for CorollariesGlandEl An analogue to the corollaries 
in the finite case would require for any choice of the good examples a sequence 
of indices n^. A: > 0, such that yf and IJfe>o for all i,j. Note 

that in the corollaries above we have that one index n fulfills this requirement 
and do not need a whole sequence. But requiring one index n would complicate 
the previous proof and is not needed for the following argument. 

For each tp* in the sequence constructed in Theorem|3 V £ Gex-Fin^,: holds 
by Proposition ^ Furthermore, this can be achieved with good examples gex^ 
that satisfy, for all i,j, the following conditions: card{gex't) > max{{i, j}) and 
furthermore gex'j is an initial segment of pj . To see this, assume 7^^; £ Gex-Fin,pi 
with good examples ex'' for some i. Define geXj, for all j, by gex'j = {{x, Pj{x)) \ 
0 < X < max{{i, j, max{domain{eXj))})}. Since the inference machine witness- 
ing V,pi £ Gex-Fin,pi has to learn from all finite supersets of the good examples 
ex' as well, obviously gex' also are good examples for with respect to p'. 

This yields that for every sequence Ufc, the set E = Ufc>o5'®^nfc b® 
infinite. For every x there is at least one y such that {x, y) £ E, since the good 
examples are initial segments. Therefore E can be contained in at most one 
function. In fact, E might not even be a function, but a relation. An analogue 
to the corollaries in the finite case therefore can not hold. 
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4 Conclusion, Open Problems, and Interpretation 

Theorem 21 and Corollary 2| witness the following fact: for every n > 2, there is 
an enumerable class of recursive functions and n hypothesis spaces, such that, 
no matter how the teacher chooses the examples for some concepts within the 
class, each machine witnessing the learnability for one of those hypothesis spaces 
will learn a different function when confronted with this set of examples. Even 
worse, they all produce seemingly identical hypotheses consistent with the given 
examples, so that it is impossible for the teacher to know, whether the machines 
have successfully learned the intended function or not. 

Of course, there are unanswered questions. 

What do concept classes look like that fulfill the corollaries? What are neces- 
sary or/and sufficient conditions for those? Or, rephrased in recursion theoretic 
terms: given two recursive numberings tp an tp, what is needed to be able to test 
“pointwise equality” between them? I.e., is there an / S 7?. such that f{i) = 1 
iff ipi = V'i? 

Another interesting question is, which classes of recursive functions have the 
property, that all their numberings with decidable equality are reducible to one 
another, or, equivalently, that all their one-one numberings are equal with respect 
to This would characterize the classes of functions that are learnable with 
typical good examples, by using Theorem El in a different way. For enumerable 
classes, necessary and sufficient conditions are known in order to assure that all 
of their numberings are equivalent with respect to cf. ^ and the references 
therein. But it is easy to see that there is an enumerable class U of functions, 
such that all one-one numberings of U are equivalent, but not all enumerations 
of U are. For example, let 17 = {/i | * S N} U { constant zero function}, where 
fi is always 0, with exception of i, where it takes value 1. Obviously, all one-one 
numberings of U are equivalent, and it is easy to construct two numberings of 
U that are not equivalent, by exploiting the fact that the constant zero function 
is an “accumulation point” for U . (The function / is an accumulation point for 
U, if for all n G N there exists such that m„ < mn+i and f{x) = gn{x) for 
all X < irin and /(m„) g-nirnn) for some gn in U.) 

In the comments following Corollary 0 it is mentioned, that “typical” exam- 
ples could be computed in the limit. It could be interesting to find an intuitive 
definition for “limit computable examples”, without “shifting” most of the in- 
ference process into this limiting computation. 

The reader only interested in mathematical results may now stop reading. 
The following tries to give some connections of the obtained results to real life 
teaching and interpret them in that context. 

As mentioned before, the pair (M, (p) might be thought of as a pupil: the 
pupils learning algorithm M and its knowledge base (/?. In a classroom the follow- 
ing might happen: a teacher gives two pupils and (M 2 , ip) the examples 

for some concepts he wants them to learn. The pupils process those examples 
and return an identical hypothesis i, as it happens even in the theoretical case, 
see the comments following CorollaryEI on page ^23 Now, if the teacher does not 
know which knowledge base a pupil is using, he might not know, if both pupils 
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have learned the intended concept. As a consequence, he will have to submit 
them both to a series of tests, in order to check what they learned. And eventu- 
ally, he will have to correct mistakes the pupils made. Normally this is achieved 
by presenting new examples to the pupil in order to make it see its mistake and 
learn the correct concept. 

The corollaries show that there are concept classes and sets of pupils, such 
that for some concepts the teacher has to compute a different set of examples 
for each pupil. But this is only feasible in small classrooms, since otherwise the 
teacher just will not have the time to dedicate enough time to each student and 
compute a different set example for each one. So the model reflects the well 
know fact, that smaller classrooms improve learning performance. In addition 
one might notice that all students have the same learning potential, since all 
knowledge bases contain the same set of functions. Hence no pupil could be 
considered “stupid”, they just generate and test ideas in a different way. This 
might be a reason why some pupils like some teachers and learn better with them: 
teacher and pupil use a similar “knowledge base” , the teacher for generating and 
the pupil for testing the examples. So it is to expect that they will achieve good 
learning performance. Of course, there might, and most likely will, be other 
reasons, but it would be interesting to test this hypothesis in a real classroom. 

I am grateful to R. Wiehagen, C. Smith and the referees for many valuable 
suggestions and to W. and A. Nessel for carefully reading a draft of this work. 
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Abstract. Blum and Blum (1975) showed that a class B of suitable re- 
cursive approximations to the halting problem is reliably FX-learnable. 
These investigations are carried on by showing that B is neither in NUM 
nor robnstly FX-learnable. Since the definition of the class B is quite 
natural and does not contain any self-referential coding, B serves as an 
example that the notion of robustness for learning is quite more restric- 
tive than intended. 

Moreover, variants of this problem obtained by approximating any given 
recursively enumerable set A instead of the halting problem K are stud- 
ied. All corresponding function classes U{A) are still FX-inferable but 
may fail to be reliably FA-learnable, for example if A is non-high and 
hypersimple. Additionally, it is proved that U{A) is neither in NUM nor 
robustly FA-learnable provided A is part of a recursively inseparable 
pair, A is simple but not hypersimple or A is neither recursive nor high. 
These results provide more evidence that there is still some need to find 
an adequate notion for “naturally learnable function classes.” 



1. Introduction 

Though algorithmic learning of recursive functions has been intensively studied 
within the last three decades there is still some need to elaborate this theory 
further. For the purpose of motivation, let us shortly recall the basic scenario. 

An algorithmic learner is fed growing initial segments of the graph of the 
target function / . Based on the information received, the learner computes a 
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hypothesis on each input. The sequence of all computed hypotheses has to con- 
verge to a correct, finite and global description of the target / . We shall refer 
to this scenario by saying that / is EX-learnable (cf. Definition^. 

Clearly, what one is really interested in are powerful learning algorithms that 
cannot only learn one function but all functions from a given class of functions. 
Gold im provided the first such powerful learner, i.e., the identification by enu- 
meration algorithm and showed that it can learn every class contained in NUM . 
Here NUM denotes the family of all function classes that are subsets of some 
recursively enumerable class of recursive functions. 

There are, however, learnable classes of recursive functions which are not 
contained in NUM . The perhaps most prominent example is the class ST> of 
self-describing recursive functions, i.e., of all those functions that compute a pro- 
gram for themselves on input 0 . Clearly, ST> is iJA-learnable. 

Since Gold’s HH pioneering paper a huge variety of learning criteria have 
been proposed within the framework of inductive inference of recursive func- 
tions (cf., e.g., By comparing these inference criteria to one 

another, it became popular to show separation results by using function classes 
with self-referential properties. On the one hand, the proof techniques developed 
are mathematically quite elegant. On the other hand, these separating examples 
may be considered to be a bit artificial, because of the use of self-describing 
properties. Hence, Barzdins suggested to look at versions of learning that are 
closed under computable transformations (cf. (2H2Sj ) . For example, a class U is 
robustly EX-learnable, iff, for every computable operator 0 such that 0{lA) is a 
class of recursive functions, the class 0{U) is ifA-learnable, too (cf. Definitional). 
There have been many discussions which operators are admissible in this context 
(cf., e.g., |10ll4ll6l20l23l28| l. At the end, it turned out to be most suitable to 
consider only general recursive operators, that is, operators which map every 
total function to a total one. The resulting notion of robust AA-learning is the 
most general one among all notions of robust AA-inference. 

Next, we state the two main questions that are studied in the present paper. 

(1) What is the overall theory developed so far telling us about the learnability 

of “naturally defined function classes?” 

(2) What is known about the robust AA -learnability of such “naturally defined 

function classes?” 

Clearly an answer to the first question should tell us something about the use- 
fulness of the theory, and an answer to the second problem should, in particular, 
provide some insight into the “naturalness” of robust AA-learning. However, 
our knowledge concerning both questions has been severely limited. For func- 
tion classes in NUM everything is clear, i.e., their learnability has been proved 
with respect to many learning criteria including robust AA-learning. Next, let us 
consider one of the few “natural” function classes outside NUM that have been 
considered in the literature, i.e., the class C of all recursive complexity functions. 
Then, using Theorem 2.4 and Corollary 2.6. in m one can conclude that C is 
not robustly AA-learnable for many complexity measures including space, since 
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there is no recursive function that bounds every function in C for all but finitely 
many arguments. On the other hand, C itself is still learnable with respect to 
many inference criteria by using the identification by enumeration learner. 

The latter result already provides some evidence that the notion of robust 
ifX- learning may be too restrictive. Nevertheless, the situation may be com- 
pletely different if one looks at classes of {0, 1} -valued recursive functions, since 
their learnability differs sometimes considerably from the inferability of arbitrary 
function classes (cf., e.g., llTl^til b As far as these authors are aware of, one of 
the very few “natural classes” of {0, 1} -valued recursive functions that may be a 
candidate to be not included in NUM has been proposed by Blum and Blum |^. 
They considered a class B of approximations to the halting problem K and 
showed that B is reliably AA-learnable. This class B is quite natural and not 
self-describing. It remained, however, open whether or not B is in NUM . 

Within the present work, it is shown that B is neither in NUM nor ro- 
bustly AA-learnable. Moreover, we study generalizations of Blum and Blum’s p] 
original class by considering classes lA (A) of approximations for any recursively 
enumerable set A. In particular, it is shown that all these classes remain EX- 
learnable but not necessarily reliably AA-inferable (cf. Theorems El and fTTl . 
Furthermore, we show U{A) to be neither in NUM nor robustly AA-learnable 
provided A is part of a recursively inseparable pair, A is simple but not hyper- 
simple or A is neither recursive nor high (cf. Theorems El and 1 1 Yll . 

Thus the results obtained enlarge our knowledge concerning the learnabil- 
ity of “naturally defined” function classes. Additionally, all those classes U{A) 
which are not in NUM as well as B are natural examples for a class which is 
on the one side not self-describing and on the other side not robustly learnable. 
So all these U{A) provide some incidence that the presently discussed notions 
of robust and hyperrobust learning |ll7llOH4llbl2dl2^ destroy not only coding 
tricks but also the learnability of quite natural classes. 



Due to the lack of space, many proofs are only sketched or omitted. We refer 
the reader to for a full version of this paper. 



2. Preliminaries 



Unspecified notations follow Rogers |25- IN = {0, 1,2, . . .} and IM* denote the 
set of all natural numbers and the set of all finite sequences of natural numbers, 
respectively. {0, 1}* stands for the set of all finite {0, l}-valued sequences and for 
all a; S IM we use {0, 1}’” for the set of all {0, 1} -valued sequences of length x. 

The classes of all partial recursive and recursive functions of one, and two 
arguments over IN are denoted by V, TZ, and 7?.^, respectively, f G V is 

said to be monotone provided for all a;,?/ G IN we have, if both f{x) and f{y) 
are defined then f{x) < f{y). TZq^i and TZmon denotes the set of all {0,11- 
valued recursive functions and of all monotone recursive functions, respectively. 

Furthermore, we write /" instead of the string (/(O), . . . , f{n)) , for any n G 
IN and f € TZ. Sometimes it will be suitable to identify a recursive function with 
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the sequence of its values, e.g., let a = (ag, ■ ■ ■ , au) G IN*, j G IN, and p G ; 
then we write ajp to denote the function / for which f{x) = ax, x < k, 
f{k -1-1) = j , and f{x) = p{x — k — 2),ifx>k + 2. Furthermore, let g GV and 
a G IN* ; we write a ^ g iff a is a prefix of the sequence of values associated 
with g, i.e., for all x <k, g{x) is defined and g{x) = ax- 

Any function tp G is called a numbering. Moreover, let ip G , then 
we write ipi for the function x ip{i,x) and set V-tp = {ipi \ i G IN} as well as 
TZp, = V-if) n TZ . Consequently, it f G Vp , , then there is a number i such that 
f = Ipi . It f G V and i G IN are such that ipi = f , then i is called a i/; -program 
for /. Let Ip be any numbering, and G IN; if ipi{x) is defined (abbr. ipi{x) I ) 
then we also say that ipi{x) converges. Otherwise, ipi{x) is said to diverge (abbr. 
). 

A numbering p G is called a Godel numbering or acceptable numbering 
(cf. P3j) iff = P , and for any numbering ip gV^ , there is a c G P such that 
Ipi = Pc{i) for all i G IN. In the following, let ip be any fixed Godel numbering. As 
usual, we define the halting problem to be the set iG = {z | i G IN, pi{i ) } } . Any 
function <P G satisfying dom{pi) = dom(^i) for all z G IN and {{i,x,y)\ 
i,x,y G IN, <Pi{x) < y} is recursive is called a complexity measure (cf. p]). 

Furthermore, let NUM = {U\ {3ip G 7Z^)[ld C V-p,]} denote the family of all 
subsets of all recursively enumerable classes of recursive functions. 

Next, we define the concepts of learning mentioned in the introduction. 

Definition 1. Let U C R and M : IN* ^ IN be a recursive machine. 

(a) (Gold in]) M is an EX-learner for U iff, for each function f G U , M 
converges syntactically to / in the sense that there is a j G IN with pj = / and 
j = M{f^) for all but finitely many rz G IN. 

(b) (Angluin ^) M is a conservative EX-learner for lA iff XI AA-learns U 
and M makes in addition only necessary hypothesis changes in the sense that, 
whenever M{ag) ^ M{a) then the program M{a) is inconsistent with the data 
(Tzy by either Pm(<j){x)'[ or PM(ar}){x) ipi ap{x) for some x G dom{ag) . 

(c) (Barzdins 0, Gase and Smith jHj) M is a BC-learner tor U iff, for each 
function f gU, M converges semantically to / in the sense that PM(f^) = / 
for all but finitely many n G IN . 

A class lA is AA-learnable iff it has a recursive AA-learner and EX denotes 
the family of all AA-learnable function classes. Similar we define when a class 
is conservatively AA-learnable or BG-learnable. We write BC for the family of 
all BG-learnable function classes. 

Note that EX C BC (cf. |H|). As far as we are aware of, it has been open 
whether or not conservative learning constitutes a restriction for AA-learning of 
recursive functions. The negative answer is provided by the next proposition. 

Proposition 2. EX = conservative-AA . 

Nevertheless, whenever suitable, we shall design a conservative learner instead 
of just an AA-learner, thus avoiding the additional general transformation given 
by the proof of Proposition 0 
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Next, we define reliable inference. Intuitively, a learner M is reliable provided it 
converges if and only if it learns. There are several variants of reliable learning, 
so we will give a justification of our choice below. 

Definition 3 (Blum and Blum 0, Minicozzi m)- Let U C R] then U is 
said to be reliably EX-leamahle if there is a machine M G TZ such that 

(1) M i?X-learns lA and 

(2) for all / G 7?., if the sequence (M (/"))nGiN converges, say to j , then ipj = f . 

By TZEX we denote the family of all reliably i?X-learnable function classes. 

Note that one can replace the condition “/ G ™ (2) of Definition 0 by 
“/ G or “all total /.” This results in a different model of reliable learning, 
say VEX and TEX , respectively. Then for every U C 7?.o,i such that U € VEX 
or U G TEX one has U G NUM (cf. |PI12I2P| L On the other hand, there are 
classes U C TZq i such that U G TZEX \ NUM (cf. ^). As a matter of fact, our 
Theorem 0 below together with Blum and Blum’s 0 result B G TiEX provides 
a much easier proof of the same result than Grabowski m- 

Finally, we define robust AA-learning. This involves the notion of general 
recursive operators. A general recursive operator is a computable mapping that 
maps functions over IN to functions over IN and every total function has to be 
mapped to a total function. For a formal definition and more information about 
general recursive operators the reader is referred to !1;^l22ti7| . 

Definition 4 (Jain, Smith and Wiehagen 1161 L Let U CTZ^ then U is said 
to be robustly EX-learnable if 0{lA) is AA-learnable for every general recursive 
operator O. By robust-AA we denote the family of all robustly AA-learnable 
function classes. 

3. Approximating the Halting Problem 

Within this section, we deal with Blum and Blum’s Pt] class B . First, we define 
the class of approximations to the halting problem considered in 0. 

Definition 5. Let r G 7?. be such that for all z G IN 



Now, we set B = {ifrii) | z G IN and <l>i G TZmon} ■ 

Blum and Blum 0 have shown B G TZEX but left it open whether or not 
B G NUM . It is not, as our next theorem shows. 

Theorem 6. B ^ NUM . 

Proof. First, recall that K is part of a recursively inseparable pair (cf. [221 
Exercise III. 6. 23. (a)]). That is, there is an r.e. set H such that KnH — 0 and for 



‘Pr{i){x) = X)] 
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every recursive set AD H we have |An FCl = oo. Now, we fix any enumeration 
koi ki, k 2 , and ho, hi, ^ 2 , ... of K and H, respectively. Suppose to the 
contrary, that there exists a numbering ip G TZ^ such that B C TZ^ . Next, we 
define for each ipf. a function ge G V as follows. For all e,a; G IN let 

ge(x) = “Search for the least n such that for n = s + y either (A), (B) or (C) 
happens: 

(A) y = hs A ipe{y) = 1 

(B) y = ksA ipeiy) = Q A y> X 

(C) iPe{y)>l 

If (A) happens first, then set ge{x) = s + y . 

If (B) happens first, then let ge{x) = <l>y{y) + y . 

If (C) happens first, then let ge{x) = 0.” 

Claim 1. ge gTZ for all e G IM. 

If there is at least one y such that ipe{y) > 1, then ge G TZ. Now let 
ipe G 7^0,1 and suppose that there is an a: G IM with ge(x)'!' . Then there are no 
s, y such that y = hg and ipe{y) = 1 . Hence, M = {y \ y gM Aipe{y) = Q} D H 
and M is recursive. Thus, |M n Ffl = oo. So there must be a y > a: such that 
Ipeiy) = 0 and an s G IN with y = kg - Thus (B) must happen, and since y = kg, 
we conclude ^y(y) i . Hence, ge(x) | , too, a contradiction. This proves Claim 1. 

Claim 2. Let e be any number such that ipe = y^T(i) for some (fiT(i) G B . Then 
ge(x) > ^i(x) for all a; G IN. 

Assume any i,e as above, and consider the definition of ge{x). Suppose 
ge{x) = s + y for some s,y such that y = hg and ipeiy) = 1- Since ipeiy) = 
ipT{i)iy) = 1 implies < ^i(y), and hence y G K, we get a contradiction 

to AT n i? = 0 . Thus, this case cannot happen. 

Consequently, in the definition of geix) condition (B) must have happened. 
Thus, some s,y such that y > x, y = kg and ipeiy) = 0 have been found. 
Since y = kg, we conclude and thus gix) > ^yiy) ■ Because of ipeiy) = 

Tr{i)iy) = 0, we obtain < ^yiy) by the definition of iprii) ■ Now, putting it 

all together, we get gix) > > ^i(y) > ^i(a;), since y > x and T>i G TZmon- 

This proves Claim 2. 

Claim 3. For every b G TZ there exists on z G IN such that <Pi G TZmon and 
b)x) < T>iix) for all x G IN. 

Let r G TZ he such that for all j, x G IN we have 

rr(])\ > ^ (^^-(0) -b 1, otherwise 

and for x > 0 

{ 0, if d>jin) is defined for all n < x A 

^[^jix) < ^jix- 1) V Fjix) < 6(x)] 
tpjix) -b 1, if ^j(n) is defined for all n < x A 

[•T^iix) < ^jix - 1) V ^jix) < 6(x)] 
t , otherwise. 
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By the fixed point theorem m there is an z £ IN such that (pr(i) = ‘fii ■ Now, one 
inductively shows that = 0°°, S 'R-mon and b{x) < <Pi{x) for all a; G IN 
and Claim 3 follows. 

Finally, by Claim 1, all ge G and thus there is a function b G TZ such that 
b{^) > 9e{x) for all e G IN and all but finitely many a; G IN (cf. |01). Together 
with Claim 2, this function b contradicts Claim 3, and hence B ^ NUM . I 

The next result can be obtained by looking at U{K) in Theorems 1 1 hi a,nd fTTl 

Theorem 7. B is TZEX -inferable but not robustly EX -learnable. 

Theorems 0 and Q immediately allow the following separation, thus reproving 
Grabowski’s H2! Theorem 5. 

Corollary 8. NUM fl p{TZo^i) C TZEX fl p(7?.o,i). 

Finally, we ask whether or not the condition <Pi G TZmon in the definition of the 
class B is necessary. The affirmative answer is given by our next theorem. That 
is, instead of B, we now consider the class B = {Pt-(i) | * G IN and <l>i G TZ} . 

Theorem 9. B is not BC -learnable. 

Next, we generalize the approach undertaken so far by considering classes U{A) 
of approximations to any recursively enumerable (abbr. r.e.) set A. 

4. Approximating Arbitrary r.e. Sets 

The definition of Blum and Blum’s 0 class uses implicitly the measure <Pk 
defined as '1>k{x) = ^x{x) for measuring the speed by which K is enumerated. 
Using this notion <Pk , the class B of approximations of K is defined as 

B={fG 7^0,l I (3^>e G TZmon) (Vx) [f{x) = 1 ^k{x) < <Pe{x)] } . 

Our main idea is to replace K by an arbitrary r.e. set A and to replace <Pk by 
a measure <1>a of (the enumeration speed of) A. Such a measure satisfies the 
following two conditions: 

— The set {{x,y) \ •T>a{x)[< y} is recursive. 

- a; G A {3y) [<Pa{x) < y] . 

Here, <Pa is intended to be taken as the function of some index i of A, but 
sometimes we might also take the freedom to look at some other functions <Pa 
satisfying the two requirements above. The natural definition for a class IA{A) 
corresponding to the class B in the case A = K based on an underlying function 
<Pa is the following. 

Definition 10. Given an r.e. set A, an enumeration <Pa and a total function 
, let 

f (x-\- f 

JeW-jo, A^[<PA{x)<<Pe{x)]. 
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Now U{A) consists of all those /e where € T^mon- 

Next, comparing U{K) to the original class B of Blum and Blum |0| one can 
easily prove the following. For every f G B there is a function g G U{K) such 
that for all a; G IN we have f{x) = 1 implies g{x) = 1 . Hence, the approximation 
g is at least as good as /. The converse is also true, i.e., for each g G U{K) there 
is an / S K such that g{x) = 1 implies f{x) = 1 for all a; S IN. Therefore, we 
consider our new classes of approximations as natural generalizations of Blum 
and Blum’s 0 original definition. 

Moreover, note that there is a function gen a which computes for every e of 
a monotone <Pe a program geuAie) for the function / associated with \ 

r 1, if < ^e(a;)i A (Vy < a;)[^e(y) < ^e(y + 1)] 
^genA{e)i.x) = S 0> if < <^e{x)[\ A (V?/ < x)[^e{y) < ^e{V + 1)] 

[t) otherwise. 

Now, if A is recursive, everything is clear, since we have the following. 
Theorem 11 . If A is recursive then U{A) G NUM . 

The direct generalization of Theorem E] would be that U{A) is not in NUM for 
every non-recursive r.e. set A and every measure 'Pa ■ Unfortunately, there are 
some special cases where this is still unknown to us. 

We obtained many intermediate results which give incidence that U(A) is 
not in NUM for any non-recursive r.e. set A . First, every non-recursive set A has 
a sufficiently “slow” enumeration such that U{A) ^ NUM for this underlying 
enumeration and the corresponding Pa ■ Second, for many classes of sets we can 
directly show that U{A) fz. NUM , whatever measure Pa we choose. Besides the 
cases where A is part of a recursively inseparable pair or A is simple but not 
hypersimple, the case of the non-recursive and non-high sets A is interesting, in 
particular, since the proof differs from that for the two previous cases. 

Recall that a set A is simple iff A is both r.e. and infinite, A is infinite but 
there is no infinite recursive set R disjoint to A. A set A is hypersimple iff A 
is both r.e. and infinite, and there is no function f G TZ such that /(n) > a„ 
for all n S IN , where oq j «i ) • ■ • is the enumeration of A in strictly increasing 
order (cf. Rogers [25 ) • Using this definition of hypersimple sets, one can easily 
show the following lemma. 

Lemma 12 . A set A C IN js hypersimple iff 

(a) A is r.e. and both A and A are infinite 

(b) for all functions g G TZ with g(x) > x for all x gTN there exist infinitely 
many x S IN such that {x, x -I- 1, . . . , g(x)} C A. 

Now, we are ready to state the announced theorem. 

Theorem 13 . U{A) is not in NUM for the following r.e. sets A. 

(a) A is part of a recursively inseparable pair. 

(b) A is simple but not hypersimple. 

( c ) A is neither recursive nor high. 
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Proof. We sketch only the proof of Assertion (c) here. Assume by way of 
contradiction U{A) G NUM . Thus, there is a, ip G TZ'^ such that U{A) C TZ^ . 
Assume without loss of generality that 0 G A. The A-recursive function dA(x) = 
max{t?yi(y) \ y < x and y G A} is total and recursive relative to A. If now 
m{x) > <1a{x) , then the function generated by m in accordance to Definition II 01 
is equal to the characteristic function of A . 

^ r / \ f 1 , ii a{x) < mix) 

So one can define the following A -recursive function h\ 

h{x) = min{?/ > x\ (Vj < x) (3z)[(a; < z <y) A ipj{z) yf A(z)]} . 

Since A is not recursive, no function tpj can be a finite variant of A{x) , and 
thus h is total. Using h we next the following total A-recursive function g by 
9i^) = J2y=l d,A{y)- Since A is not high, there is a function b G TZ such that 
t'ix) > g{x) for infinitely many x. By Claim 3 in the demonstration of Theo- 
rem 0 there exists an e G IN such S TZmon and d>e{x) > b{x) for all a; G IN. 
Thus, d>f,ix) > g{x) for infinitely many x. 

Next, for every ipk G TZ^ there exists an x > fc such that ’Peix) > g{x) . 
Consider all ?/ = x, x -I- 1, . . . , h{x) . By the definition of g and by Pe G TZmon , 
we have Pe{y) > dA{y) for all these y. Thus, by the choice of dA and the defini- 
tion of y}genA(e) we arrive at ^PgenA(e)iv) = Mv) for all y = X, X -k 1, . . . , h{x) . 
But now the definition of the function h guarantees that ipk{z) yf ‘^genA(e)iz) 
for some z with x < z < h{x) . Consequently, (pgenA(e) differs from all fk in 
contradiction to the assumption U{A) G NUM . I 



5. Reliable and -Learnability of U{A) 

Blum and Blum jS] showed B G TZEX . The AA-learnability of U{A) alone can 
be generalized to every r.e. set A, but this is not possible for reliability. But 
before dealing with T^.AA-inference, we show that every UiA) is AA-learnable. 

Theorem 14. U{A) is EX -learnable for all r.e. sets A. 

Proof. If A is recursive, then U{A) G NUM (cf. Theorem II I II and thus EX- 
learnable. So let A be non-recursive and let Pa be a recursive enumeration of A. 
An AA-learner for the class UiA) is given as follows. 

— On input tr, disqualify all e such that there are x G dom(cr) and y < |cr| 
satisfying one of the following three conditions: 

(a) PgenA(e)ix) < \cr\ and (fgenA(e)ix) ct(x) 

(b) cr(x) = Q, pAix) <y and -'[Peix) < y] 

(c) Peix -k 1) < y and ~^[Peix) < y\ . 

— Output geriAie) for the smallest e not yet disqualified. 
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The algorithm disqualifies only such indices e where V^geriA(e) either defined 
and false or undefined for some x € dom{a) . Thus the learner is conservative. 

Since the correct indices are never disqualified, it remains to show that the 
incorrect ones are. This clearly happens if <PgenA{e)iy) <^{y) for some y. 
Otherwise let z be the first undefined place of iPgenA(e) ■ This undefined place is 
either due to the fact that <Pe{x) > d>e{x + 1) for some a; < z or that <Pe{z) t • 
In the first case, e is eventually disqualified by condition (c), in the second case, 
either ^e(a; -|- 1) | for some first x > z, then e is again eventually disqualified 
by condition (c) or {x) | for some x € A above z and so e is disqualified by 
condition (b). Hence, the learning algorithm is correct. I 



The result that B is reliably ifX-learnable can be generalized to halves of re- 
cursively inseparable pairs and to simple but not hypersimple sets. 

Theorem 15. U{A) is reliably EX-learnable if 

(a) A is part of a recursively inseparable pair or 

(b) A is simple but not hypersimple. 

Proof. The central idea of the proof is that conditions (a) and (b) allow to 
identify a class of functions which contains all recursive functions which are too 
difficult to learn and on which the learner then signals infinitely often diver- 
gence. The recursive functions outside this class turn out to be ifX-learnable 
and contain the class U (A) . 

The learner M does not need to succeed on functions / ^ 7?.o,i or if f[x) = l 
for almost all x G A. Now, the second condition can be checked indirectly for 
/ G 7?.o,i and the A in the precondition of the theorem. 

In case (a) , let A and H = {5o, 6i, . . .} form a recursively inseparable pair. 
If f{x) = I for almost sA x G A then /(6s) = 1 for some bg. So one defines 
that a disqualifies if a{x) > 2 for some x or if a{bg) i = 1 for some s < \a\. 

In case (b), the set A is simple but not hypersimple. By Lemma El there 
is a function g G TZ with g{x) > x for all a; G IN such that A intersects every 
interval {x, a; -I- 1, ... , g{x)} . But if f{x) = 1 for almost all x G A, then, by the 
simplicity of A, f{x) = 1 for almost all x and there is an x with f(y) = 1 for 
all y G {x,x -I- 1, . . . ,g{x)}. So one defines that a disqualifies if a(x) > 2 for 
some X or if there is an x and a{y) = 1 for all y G {a;, a; -I- 1, ... , g{x)} . 

The reliable i?X-learner iV is a modification of the learner M from The- 
orem O which copies M on all a except those which disqualify — on them, 
N always outputs a guess for a0°° and thus either converges to some (t 0°° or 
diverges by infinitely many changes of the hypothesis. Let e{a) be a program 
for (t 0°° and let 

'i _ / 6(o"), if <J is disqualified 
^ |_M(cr), otherwise. 

For the verification, note that for every / G U(A) we have f{x) = 0 for all 
X G A. Thus, if / G IA{A) then no cr ^ / is disqualified and therefore N is an 
ifX-learner for U{A). 
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Assume now that N converges to an e' on some recursive function / . If 
this happens for a function / such that some u' ^ f has been disqualified then 
/ = crO°“ and so also ipe' = for some a ^ f . Thus, N converges to a 
correct program for / in this case. 

Otherwise, no cr' ^ / is disqualified. Since N copies the indices of M and 
those are all of the form geuAie), there is a least e with e' = genA{e) ■ If 
f{x) = 0 for infinitely many x € A, then M converges only to geuAie.) if 
fgenA(e) = / the algorithm is correct in that case. 

Finally, consider the subcase that f{x) = 0 for only finitely many x € A. 
Consequently, in case (a) f{x) = 1 for some x G B and in case (b) there must 
be an x such that f{y) = 1 for all y = a;, a; + 1, . . . , g{x) . In both cases, some 
cr' ^ / is disqualified, thus this case cannot occur. Hence, N is reliable. I 

Theorem 16. If A is hypersimple and not high then U{A) ^ TZEX . 

Proof. Let A be a hypersimple non-high set, let <Pa be a corresponding mea- 
sure, and assume to the contrary that U{A) G TZEX . Then also the union 

U{A) \j{al^ \ a e {0,1}*} 

is AA-learnable, since every class in NUM is also in TZEX and TZEX is closed 
under union (cf. mu). Given an AA-learner M for the above union, one can 
define the following function hi by taking 

hi{x) = minjs > x\ (Vcr S {0, 1}"^) (Vy < a;) yf M{a) V 

^M(a){.y) <S A <fM{cr){y) = ^(y)]}. 

The function hi is total since any guess M{a) either computes the function 
al°° or is eventually replaced by a new guess on (t 1°° . Note that hi G TZmon 
and hi{x) > x for all x. 

Since A is not recursive, there is no total function dominating (I a ■ Thus 
one can define a recursive function h 2 {x) by taking 

^2 (x) = the smallest s such that there is a y with x < y < s A 

hi{y + hi{y)) < s A<^A{y) + -I- 1) -I- ... -I- ^A{y + hi{y)) < s . 

Since A is hypersimple, we directly get from Lemma Elthat h 2 G TZ. Consider 
for every / G U{A) the index i to which M converges and an index j with 
f = gen a{j)- 

Assume now that M has converged to i at z < a;. Consider the y, s from 
the definition of ft -2 and let a = /(O), . . . , /(y) . If yf M{a) then 

there is some y' G {y,y + l,...,y + hi{y)} with /(y') = 0. As a consequence, 
^j{y') < d>A{y') < h 2 {y) ■ Since G TZmon, we know I>j{y) < s. Otherwise, 
<I>i{x) < hi{y) and (fi{x) has converged. Since y < h 2 {x), we conclude (Ii{x) < 
hi(Ti 2 {x)) . So one can give the following definition for / by case-distinction 
where the first case is taken which is applicable and where a = /(O), . . . , f(z) . 
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if a; G dom{a) 
if < hi{h2{x)) 
if d>A{x) < d>j{x) < h2{x) 



otherwise. 



Since the search-conditions in the second and third case are bounded by a re- 
cursive function in x, the family of all contains only total functions 

and its universal i,j,a,x ipe{ij^a-){x) is computable in all parameters. Fur- 
thermore, for the correct i,j,a as chosen above, ipe{i,j,<j) equals the given / 
since, for all x > z, either (fii(x) converges within hi{h 2 {x)) steps to f{x) or 
^a{x) < d>j{x) < h 2 {x). It follows that this family covers U{A) and that hl{A) 
is in NUM which, a contradiction to Theorem 113 since A is neither recursive 



6. Robust Learning 

A mathematical elegant proof method to separate learning criteria is the use of 
classes of self-describing functions. On the one hand, these examples are a bit 
artificial, since they use coding tricks. On the other hand, natural objects like 
cells contain a description of themselves. Nevertheless, from a learning theoretical 
point some criticism remains in order, since a learner needs only to fetch some 
code from the input. 

Therefore, Barzdiris suggested to look at restricted versions of learning: For 
example, a class S is robustly AA-learnable, iff, for every operator 0, the class 
0{S) is AA-learnable. There were many discussions, which operators 0 are 
admissible in this context and how to deal with those cases where 0 maps 
some functions in S to partial functions. At the end, it turned out that it is 
most suitable to consider only general recursive operators 0 which map every 
total function to a total one m- This notion is among all notions of robust 
AA-learning the most general one in the sense that every class S which is 
robustly AA-learnable with respect to any criterion considered in the literature 
is also robustly AA-learnable with respect to the model of Jain, Smith and 
Wiehagen m- 

Although the class B is quite natural and does not have any obvious self- 
referential coding, the class B is not robustly AA-learnable — so while on the 
one hand the notion of robust AA-learning still permits topological coding tricks 
113231, it does on the other hand already rule out the natural class B. The 
provided example gives some incidence, that there is still some need to find a 
adequate notion for a “natural AA-learnable class.” 

Every class in NUM is robustly AA-learnable, in particular the class U{A) 
for a recursive set A (cf. Theorem II II) . The next theorem shows that U (A) 
is not robustly AA-learnable for any nonrecursive sets A which are part of a 
recursively inseparable pair, which are simple but not hypersimple or which are 
neither recursive nor high. Thus, here the situation is parallel to the one at 
Theorem II 81 



nor high. 
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Theorem 17. IA{A) is not robustly EX-learnable for the following r.e. sets A. 

(a) A is part of a recursively inseparable pair. 

(b) A is simple but not hypersimple. 

( c ) A is neither recursive nor high. 

7. Conclusions 

The main topic of the present investigations have been the class B of Blum and 
Blum and the natural generalizations U(A) of it obtained by using r.e. sets A 
as a parameter. It is has been shown that for large families of r.e. sets A, these 
classes U{A) are not in NUM. Furthermore, they can be always FX-learned. 
Moreover, for some but not all sets A there is also a TZEX -learner . Robust Un- 
learning is impossible for all non-recursive sets A that are part of recursively 
inseparable pair, for simple but not hypersimple sets A and for all sets A that 
are non-high and non-recursive. Since the classes IA{A) are quite natural, this 
result adds some incidence that “natural learnability” does not coincide with 
robust learnability as defined in the current research. 

Future work might address the remaining unsolved question whether U{A) 
is outside NUM for all non-recursive sets A. Additionally, one might investigate 
whether U{A) is robustly BC-learnable for some sets A such that U{A) is 
not robustly UA -inferable. It would be also interesting to know whether or not 
U{A) can be reliably UC-learned for sets A with U{A) fz. TZEX (cf. [IB] for 
more information concerning reliable BC-learning). Finally, there are some ways 
to generalize the notion of U{A) to every K -recursive set A and one might 
investigate the learning theoretic properties of the so obtained classes. 
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Abstract. Functional dependencies play an important role in the design 
of databases. We study the learnability of the class of minimal covers of 
functional dependencies (MCFD) within the exact learning model via 
queries. We prove that neither equivalence queries alone nor membership 
queries alone suffice to learn the class. In contrast, we show that learning 
becomes feasible if both types of queries are allowed. We also give some 
properties concerning minimal covers. 



1 Introduction 

Functional dependencies were introduced by Codd P as a tool for designing 
relational databases. Based on this concept, a well developed formalism has 
arisen, the theory of normalization. This formalism helps to build relational 
databases that lack undesirable features, such as redundancy in the data and 
update anomalies. 

We study the learnability of the class MCFD (minimal covers of functional 
dependencies) in the model of learning with queries due to Angluin j1 . In this 
model the learner’s goal is to identify an unknown target concept c in some class 
C. In order to obtain information about the target, the learner has available two 
types of queries: membership and equivalence queries. In a membership query 
the learner supplies an instance x from the domain and gets answer YES if x 
belongs to the target, and NO otherwise. The input to an equivalence query is 
some hypothesis h, and the answer is either YES if /i = c or a counterexample 
in the symmetric difference of c and h. 

The class C is learnable if the learner can identify any target concept c in 
time polynomial in the size of c and the length of the largest counterexample 
received. 

* Partially supported by the Spanish DGICYT through project PB95-0787 (KOALA). 
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We prove that neither equivalence queries alone nor membership queries alone 
suffice to learn MCFD. For these negative results we use techniques similar to 
those in Hinpi . 

On the other hand, we show that MCFD is learnable using both types of 
queries. Our algorithm is a modification of Angluin et al.’s algorithm ^ to learn 
conjunctions of Horn clauses. We also show that the size of equivalent minimal 
covers of functional dependencies is polynomially related. 

Some related work can be found in fUTTlI where the authors study how prior 
knowledge can speed up the task of learning. They propose functional depen- 
dencies (“determinations” in their terminology) as a form of prior knowledge. 
They pose the question of whether prior knowledge can be learned. This paper 
investigates that direction. 

The paper is organized as follows. In Section |21 we introduce definitions re- 
lated to functional dependencies, and some algorithms that are folk-knowledge. 
We need them to prove some properties that will be used throughout the pa- 
per. In Sections 0 and 21 we prove negative results for membership queries and 
equivalence queries respectively. Finally, Section 21 shows the learning algorithm 
using membership and equivalence queries. 

2 Preliminaries 

In what follows, we give some definitions, properties and algorithms related to 
functional dependencies, most of which can be found in any databases text book 
(see HUS!). For definitions concerning the model of learning via queries we refer 
the reader to 0 . 

2.1 Ftinctional Dependencies and Minimal Covers 

A relation scheme R = {Ai, A 2 , . . . , A„} is a set of attributes. Each attribute 
Ai takes values from domain DOM{Ai). An instance r of relation scheme R is 
a subset of DOM{Ai) x DOM{A 2 ) x . . . x DOM (An). The size of an instance 
r is the number of n-tuples of r. 

Given R and A, Y subsets of R, the functional dependency X — s- V is a 
constraint on the values that instances of R can take. More precisely, we say 
X — > V, read “X functionally determines V”, if for every instance r of R, and 
for every pair (^ 1 ,^ 2 ) of tuples of r, ti{X) = t 2 {X) ti{Y) = t 2 {Y) (where 
ti{Z) = t 2 {Z) means that tuples t\ and t 2 coincide in the value of all attributes 
in Z). Given a functional dependency X — > Y, we call X the antecedent of the 
functional dependency and Y the consequent. 

We say that a functional dependency X — s- Y is logically implied by a set of 
functional dependencies F if every instance r of R that satisfies the dependencies 
in F also satisfies X — > Y. If r does not satisfy X — s- Y, then we say that r 
violates X — > Y. 
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Definition 1. The closure of a set of dependencies F, denoted by F~^ , is the 
set of functional dependencies that are logically implied by F. 



Definition 2. Let F and G be sets of dependencies. F is equivalent to G (F = 
G) ifF+ = G+. 



Definition 3. The closure of a set of attributes X , written X~^ , with respect to 
a set of dependencies F, is the set of attributes A such that X — > A is in F+. 

Given a relation scheme R, X C R and a set of functional dependencies F, the 
following algorithm (see j^) computes X~^ with respect to F, in time polynomial 
in \R\ and |F|. 

Algorithm Closure 

input X; 

X+ = X; 

repeat 

OLDX+ := A+; 
for each dependency V 
itV CX+ then A+ 
end if 
end for 

until OLDX+ = X+; 



— > W in F do 
:= A+ U W; 



It is easy to test whether two sets of dependencies F and F' are equivalent: for 
each dependency X — > Y in F{F'), test whether X — > Y is in F'{F) using 
the above algorithm to compute A'*' with respect to F'{F) and then checking 
whether Y C A+. We will use this test in Subsection r2.2l to prove some properties 
concerning minimal covers. 

Definition 4. Let F be a set of dependencies. A set of dependencies G is a 
minimal cover for F if: 

1. G = F. 

2. The consequent of each dependency in G is a single attribute. 

3. For no dependency X — > A in G, G — {X — > A} = F. 

4-. For no dependency X — > A in G and proper subset Y of X , 

{G-{X — > A}) U {F — >A} = F. 

We outline a procedure to find a minimal cover -there can be several- for a given 
set of dependencies F (see HH for more details). First, using the property that 
a functional dependency X — > Y holds if and only if X — > A holds for all A 
in F, we decompose all dependencies in F so that condition 2 of the definition is 
fulfilled. Then, for conditions 3 and we check repeatedly whether dropping a 
dependency (or some attribute in the antecedent of a dependency) from F yields 
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a set F' equivalent to F. If it is so we substitute F' for F, and keep applying 
the procedure until neither dependencies nor attributes can be eliminated. 

Note that it follows from the above that the size of a minimal cover for F 
is never much bigger that the size of F itself (at most a multiplicative factor in 
the number of attributes), what makes learning minimal covers as interesting as 
learning sets of general functional dependencies. 

2.2 Some Properties of Minimal Covers 

Now we prove that the size of equivalent minimal covers is polynomially related. 
First we need a lemma. 

Lemma 1. Let R he a relation scheme, let F and F' he minimal covers of 
functional dependencies over R. If F = F' then any dependency X — > A in F 
can he inferred from F' using at most |i?| dependencies of F' . 

Proof. Let us assume, by way of contradiction, that more than |i?| depen- 
dencies of F' are needed to infer X — > A. Then, at least two of them, say D\ 
and Z? 2 , must have the same consequent. If we run Algorithm Closure to com- 
pute A+, the last one of Di and D 2 examined by the algorithm does not force 
the inclusion of any new attribute into A"*', and thus is unnecessary. □ 

Corollary 1. Let R he a relation scheme, let F and F' he minimal covers of 
functional dependencies over R. If F = F' then IP'I < |i?| * |A|. 

Proof. Suppose, by contradiction, that |P'| > |i?| * |F|. Then there is some 
dependency D G F' that, by Lemma n is not used to infer any of the depen- 
dencies in F. Let G he F' — {D}. Clearly F+ can be inferred from G, and since 
F+ = {F')'^ then D is redundant in F' , that is, F' is not a minimal cover. □ 

In Sections 13 and 0 we define some target classes containing sets of depen- 
dencies, that must be inequivalent and minimal covers. The following lemmas 
will allow us to ensure such requirements. 

Lemma 2. Let F he a set of functional dependencies over R = {Ai, . . . An, B} 
such that the consequent of each dependency in F is the attribute B. If for all 
X — > B in F it holds that X does not contain the antecedent of any other 
dependency of F, then F is a minimal cover for F. 

Proof. Obviously F satisfies conditions 1 and 2 of Definition 0] To check 
that F satisfies conditions 3 and 4 note that for every dependency X — > B in 
F, and for all proper subset Y of X, A+ with respect to F — {X — > B} and 
with respect to F do not contain B. □ 

Lemma 3. Let F\ and F 2 he sets of dependencies whose consequents are the 
single attribute B. If there exists a dependency Y — > B in F\, such that Y 
does not contain the antecedent of any dependency of F 2 , then both sets are 
inequivalent. 

Proof. Let Y — > B be the dependency of the hypothesis. As B ^ with 
respect to F 2 , Fi ^ F 2 . □ 
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3 Membership Queries 

To show that MCFD is not learnable using membership queries alone, a stan- 
dard adversary technique is used. We define a large target class, from which the 
target cover will be selected, in such a way that the answer to any membership 
query eliminates few elements from the target class. This will force the learner 
to make a superpolynomial number of queries to identify the target cover. 

Theorem 1 . The class MCFD cannot be learned using a polynomial number 
of poly nomially- sized membership queries. 

Proof. Let R = {Ai, A2, . . . An, B}, for n even, be a relation scheme and 
let p be any polynomial. The target class will contain 2 ? covers, all of them 
having dependencies 



A1A2 *■ B, A3T4 > B, . . . , An—lAn — > B. 

Besides, each cover contains a distinguished dependency, whose antecedent has 
one attribute picked up from the antecedent of each dependency above, and B 
as the consequent. By Lemma El all covers in the target class are minimal, 
and by Lemma El they are logically inequivalent. 

Using T„ as the target class, suppose the learner makes a membership query 
with instance r of size at most p(n). The adversary considers every pair of tuples 
{ti,t2) in r, and answers according to the following rule: 

— If ti{B) = t2{B) for every pair {ti,t2), then the answer is YES. No cover is 
eliminated from 

— Otherwise, let S be the set of all pairs (^1,^2) such that ti{B) t2{B): 

• If for some (ti,t2) G S there exist attributes A2i-i-iM2(i-i-i) (0 < i < 
§ - 1) such that ti{A2i+\A2(i+i)) = O (^21-1-1^2(1-1-1)), then the answer 
is NO. Again no cover is eliminated from T„. 

• If none of the conditions above hold then the answer is YES. Note that 
this answer removes covers from Tn in the worst case, the case being 
that r can be partitioned into pairs {ti,t2), where t\ and t2 coincide in 
the value of exactly ^ attributes. 

Therefore, to identify the target cover the learner must make at least — 1 
membership queries. □ 

4 Equivalence Queries 

As in the case of membership queries, to prove nonlearnability with equivalence 
queries alone we use an adversary argument. First, we need a lemma. 

Lemma 4 . Let F be a minimal cover defined over R = {Ai, A2, . . . , A„, B}, 
containing p dependencies, each of them having consequent B, and antecedent 
with at least y/n attributes from {Ai, A2, . . . , A„}. There exists some instance r 
of R with the following properties: 
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— r has two tuples (^1,^2)- 

— r satisfies F. 

— The number of attributes Ai G {^1,^2, • ■ ■ An} for which ti{Ai) ^ t2{Ai) is 
at most 1 + -^^(Inp). 

Proof. Since all antecedents of dependencies in F have at least -^/n attributes 
from {Ai,A2, . . . , An}, there must be some attribute Ai that occurs in at least 
^ of them. We now can delete from F the dependencies that have Ai in their 
antecedent, and apply the same procedure to the remaining set of dependencies. 
After doing so k times we are left with a set of at most p(l — dependencies. 
Taking k = 1 + ^/n{lnp), we obtain p{l — < 1 - Therefore, there is a set X 

with at most l + yAi{lnp) attributes such that all the antecedents of dependencies 
in F have some attribute in X. 

The instance r = {t\,t2) where ti(A) ^ t2{A) for all A € A and ti{R — X) = 
t2{R — X) surely satisfies F. □ 

Theorem 2 . The class MCDF cannot be learned using a polynomial number 
of polynomially- sized equivalence queries. 

Proof. Let R = {Ai, A2, . ■ . , An, B} be a relation scheme. We define the 
target class, T„, that contains every cover G that satisfies the following: 

— G has ^/n dependencies of the form Ai^Ai^ . . . Ai^ — s- B. 

— The antecedents of the dependencies in G are pairwise disjoint. 

By LemmaEland LemmaOlall covers in T„ are minimal and logically inequivalent. 
The cardinality of is 

IT 1 = ^ 

Now, to an equivalence query on input F, of size at most p = n^, where c is a 
constant, the adversary answers as follows: 

— If there is some dependency X — > Ai in F, where Ai G {Ai, A2, . . . An}, 
then give as a counterexample the instance r = {t\,t2) where ti{Ai) ^ t2{Ai) 
and ti{R — {Ai}) = t2{R — {Ai}). Clearly this counterexample does not 
remove any cover from T„. 

— Otherwise, if there is some dependency X — > B in F and |A| < yAi, then 
return as a counterexample the instance r = {t\,t2) where ti(A) = t2{X) 
and ti(A) ^ t2(A) for all A € R — X. No cover in Tn is violated by this 
instance, although it violates F. 

— If none of the cases above hold then Lemma El guarantees the existence of 
an instance r = {t\,t2) that satisfies F, and whose tuples disagree in the 
values of at most 1 + c^/nifn n) attributes. In this case give that instance as 
counterexample. The covers that will be eliminated by the counterexample 
are those for which the antecedent of every dependency has at least one 
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attribute in which the tuples of r disagree. Therefore, the number of covers 
that r eliminates from is at most 



5 The Learning Algorithm 

In this section we show that a slight modification of Angluin et al.’s algorithm 
HORN for learning conjunctions of Horn clauses, using membership and equiv- 
alence queries, yields an algorithm that learns MCFD. 

First, we discuss the meaning of positive and negative counterexamples in the 
setting of functional dependencies. Let us assume that the counterexamples are 
instances of two tuples (obviously, a one-tuple instance never can be a counterex- 
ample). In this case, a positive counterexample (ti,t2) tells that no dependency 
having its antecedent contained in the set of attributes where ti and t2 agree, 
and its consequent outside, can be in the target cover. In contrast, a negative 
counterexample indicates that at least one dependency satisfying the conditions 
just mentioned must be in the target. 

Note that the significance of these counterexamples is the same as the mean- 
ing of counterexamples in the case of Horn clauses, if we translate “set of 
attributes where ti and t2 agree” into “set of variables assigned true”. Also 
note that there is no syntactic difference between a conjunction of Horn clauses 
and a minimal cover for a set of functional dependencies hence the input to 
equivalence queries has the same “shape”, no matter what oracle -EQhorn or 
EQmcfd- we use. 

^ There is an exception to this statement. The counterpart to a Horn clause c with no 
positive literal should be a functional dependency / with the empty set as consequent. 
The difference is not merely syntactic but also semantic, since / does not impose 
any constraint on the instance space, that is, / is superfluous unlike c. This fact 
rules out the straightforward transformation of the target class used by Angluin (3| 
to prove approximate fingerprints for CNF (and implicitly for conjunctions of Horn 
clauses) into a target class for proving non-learnability with equivalence queries alone 
for MCFD. The reason is that, once transformed, the class would contain just one 
minimal cover: the empty one. Also the counterpart to a Horn clause c with no 
negative literals should be a functional dependency / with the empty set in the 
antecedent. However, in this case the meaning of both c and / is alike 




The fraction of covers removed from T„, that is, is at most 



{1 + c-Jn{\nn))'^ 



(■y/n)! — 

which is superpolynomially small in n. 



□ 
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Therefore, were we to learn MCFD over an instance space containing only 
two-tupled examples, the transformation of HORN would be straightforward: 
substitute EQmcfd and MQmcfd for EQhorn and MQhorn respectively; 
whenever EQmcfd provides a counterexample (ti,t2), convert into a 

boolean vector by setting to true the attributes (variables) where Q and t2 
agree and false elsewhere; finally, perform the reverse mapping before asking 
any membership query. One last remark, if we wanted the learning algorithm to 
be proper, in the sense that inputs to equivalence queries be in the class MCFD, 
we should transform the hypotheses generated into minimal covers. This can be 
done in polynomial time. 

Now, we wish to address the problem of learning in the general case, that is, 
when the instance space is not restricted to contain only two-tupled instances. 
The key observation is that, to detect the violation of some dependency or the 
need of its inclusion in the current hypothesis, it suffices to consider pairs of 
tuples. When a fc-tupled positive counterexample is received, we consider all 
(2) pairs of tuples, and for each of them proceed to remove from the current 
hypothesis the dependencies that are violated. If the counterexample is negative 
then we ask (2) membership queries to detect a pair of tuples -there must be 
at least one- that violates some dependency in the target cover, and proceed 
accordingly, that is, trying to identify the dependency in order to include it in 
the current hypothesis. Thus, we have reduced the problem of learning MCFD 
over an unrestricted instance space to that of learning when the instances have 
two tuples. 

We present now the algorithm that learns MCFD. We follow the notation 
in 0 as much as possible. For x and y boolean vectors, true{x) is the set of 
attributes assigned true by a:; x D y is the boolean vector such that true{x n 
y) = true{x) n true{y). Given a two-tupled relation r, sketch{r) is the boolean 
vector whose true values correspond to the attributes for which the tuples of r 
agree. For boolean vector x, rel{x) maps x onto a relation r (there are many) 
such that sketch(r) = x. Finally, if a; is a boolean vector such that true{x) = 
{Ax,A 2, . . . Ak\, then FD{x) denotes the set of functional dependencies 

FD{x) = {Ai,A 2, ■ ■ ■ Ak — > B : B ^ true{x)}. 

The algorithm maintains a sequence S of boolean vectors that are sketches 
of negative counterexamples, each of them violating distinct dependencies of the 
target cover. This sequence is used to generate a new hypothesis F by taking 
the union of FD{x) for all x in S. Since we want a proper learning algorithm, 
we must transform the hypothesis F thus generated into a minimal cover C, 
prior to any equivalence query. Note that when a positive counterexample is 
received we eliminate dependencies from F instead of G. In doing so we preserve 
the parallelism with algorithm HORN , where a hypothesis may contain clauses 
that are implied by other clauses in the same hypothesis. This is to prevent the 
algorithm from possibly entering an infinite loop. On the other hand, it is obvious 
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that the counterexample provided by an equivalence query is independent of 
whether the input to that query is a minimal cover or not, as long as they are 
equivalent. (For more explanations and ideas behind the algorithm see 0 |). 

Set S to be the empty sequence; /* Si denotes the i-th boolean vector of S */ 

Set F to be the empty hypothesis; 

Set G to be the empty hypothesis; /* G is a minimal cover for F */ 

while EQmcfd{G) yf FES' loop 

Let r be the counterexample relation returned by the equivalence query; 

if r violates at least one functional dependency of F 
then /* r is a positive example */ 

remove from F every dependency that r violates; 
else /* r is a negative example */ 

ask (at most ('2')) queries to MQmcfd until a 
negative answer is got for some (ti,t2) in t', 
x:= sketch ((^1,^2)); 
for each Si in S such that true{si 0 x) 

is properly contained in true(si) loop 
MQMCFD{rel{si n a;)); 
end loop; 

if any of these queries is answered NO 

then 

let i =min {j : MQMCFD{rel{sj Ha;)) = NO}; 
replace Si with Si fl a;; 

else 

add X as the last element in the sequence S; 

end if ; 

end if; 

set G to be a minimal cover for F; 

end loop; 

return G; 

end; 

The correctness of the algorithm follows from the correctness of H ORN and 
the comments above. About the query and time complexity, the algorithm makes 
as many equivalence queries as FIORN makes. However, both the number of 
membership queries and the time complexity are increased, since the counterex- 
amples can have an arbitrary number of tuples, and for each counterexample 
received the algorithm has to compute a minimal cover. In any case, the com- 
plexity is polynomial in the size of the target cover, the number of attributes of 
the relation scheme and the number of tuples of the largest counterexample. 
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Abstract. Boolean formulas are known not to be PAC-predictable even 
with membership queries under some cryptographic assumptions. In this 
paper, we study the learning complexity of some subclasses of boolean 
formulas obtained by varying the basis of elementary operations al- 
lowed as connectives. This broad family of classes includes, as a par- 
ticular case, general boolean formulas, by considering the basis given by 
{AND, OR, NOT}. We completely solve the problem. We prove the fol- 
lowing dichotomy theorem: For any set of basic boolean functions, the 
resulting set of formulas is either polynomially learnable from equivalence 
queries or membership queries alone or else it is not PAC-predictable even 
with membership queries under cryptographic assumptions. We identify 
precisely which sets of basic functions are in which of the two cases. Fur- 
thermore, we prove than the learning complexity of formulas over a basis 
depends only on the absolute expressivity power of the class, ie., the set 
of functions that can be represented regardless of the size of the represen- 
tation. In consequence, the same classification holds for the learnability 
of boolean circuits. 



1 Introduction 

The problem of learning an unknown boolean formula under some determined 
protocol has been widely studied. It is well known that, even restricted to propo- 
sitional formulas, the problem is hard miiHi in the usual learning models. There- 
fore researchers have attempted to learn subclasses of propositional boolean 
formulas obtained by enforcing some restrictions on the structure of the for- 
mula, specially subclasses of boolean formulas in disjunctive normal form (DNF). 
For example, fc-DNF formulas, fc-term DNF formulas, monotone-DNF formulas, 
Horn formulas, and their dual counterparts misiq have all been shown ex- 
actly learnable using membership and equivalence queries in Angluin’s model 
while the question of whether DNF formulas are learnable is still open. Another 
important class of problems can be obtained by restricting the number of occur- 
rences of a variable. For example, whereas there is a polynomial-time algorithm 
to learn read-once formulas with equivalence and membership queries |2j, the 
problem of learning read-thrice boolean formulas is hard under cryptographic 
assumptions 0. 
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In this paper we take a different approach. We study the complexity of learn- 
ing subclasses of boolean formulas obtained placing some restrictions in the el- 
ementary boolean functions that can be used to build the formulas. In general, 
boolean formulas are constructed by using elementary functions from a complete 
basis, generally {AND, OR, NOT}. In this paper we will allow formulas to use 
as a basis any arbitrary set of boolean functions. 

More precisely, let F = {/i, . . . , /^j be a finite set of boolean functions. A 
formula in FOR(A') can be any of (o) a boolean variable, or (b) an expression 
of the form f{gi,...,gk) where / is a fc-ary function in F and gi,...,gk are 
formulas in FOR(F). 

For example, consider the problem of learning a monotone boolean formula. 
Every such formula can be expressed as a formula in the class FOR({AND, OR}). 
The main result of this paper characterizes the complexity of learning FOR(E) 
for every finite set F of boolean functions. The most striking feature of this 
characterization is that for any F, FOR(E) is either polynomially learnable 
with equivalence or membership queries alone or, under some cryptographic 
assumptions, not polynomially predictable even with membership queries. 

This dichotomy is somewhat surprising since one might expect that any such 
large and diverse family of concept classes would include some representatives of 
the many intermediate learning models such as exact learning with equivalence 
and membership queries, PAC learning with and without membership queries 
and PAC-prediction without membership queries. 

Furthermore, we give an interesting classification of the polynomially learn- 
able classes. We show that, in a sense that will be made precise later, FOR(E) 
is polynomially learnable if and only if at least one of the following conditions 
holds: 

(a) Every function f{xi,X2,---,Xn) in F is definable by an expression of the 
form Co V (ci A a;i) V (c2 A X2) V • • • V (c„ A Xji) for some boolean coefficients 
Ci (1 < i < n). 

(b) Every function f{xi,X2,---,Xn) in F is definable by an expression of the 
form Co A (ci V a;i) A (c2 V X2) A • • • A (c„ V Xn) for some boolean coefficients 
Ci (1 < i < n). 

(a) Every function f{xi,X2,---,Xn) in F is definable by an expression of the 
form Co © (ci A a;i) 0 (c2 A 0:2) © ■ ■ ■ © (c„ A a;ri) for some boolean coefficients 
Ci (1 < * < n). 

There is another rather special feature of this result. Learnability of boolean 
formulas over a basis F depends only on the set of functions that can be expressed 
as a formula in FOR(E) but it does not depend on the size of the representation. 
As a consequence of this fact, the same dichotomy holds for representation classes 
which, in terms of absolute expressivity power, are equivalent to formulas, such 
as boolean circuits. 

As an intermediate tool for our study we introduce a link with some well 
known algebraic structure, called clones in Universal Algebra. In particular, 
we make use of very remarkable result on the structure of boolean functions 
proved by Post m- This approach has been successful in the study of other 
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computational problems such as satisfiability, tautology and counting problems 
of boolean formulas learnability of quantified formulas 0 and Constraint 
Satisfaction Problems [IS] Ellini El C21 

Finally we mention some similar results: a dichotomy result for satisfiability, 
tautology and some counting problems over closed sets of boolean functions 
the circuit value problem [HE], the satisfiability of generalized formulas 1241 . 
the inverse generalized satisfiability problem El, the generalized satisfiability 
counting problem |7], the approximability of minimization and maximization 
problems EiiiEisg, the optimal assignments of Generalized Propositional For- 
mulas El the learnability of quantified boolean formulas 0 . 

2 Learning Preliminaries 

Most of the terminology about learning comes from Strings over X = D* 
will represent both examples and concept names. A representation of concepts C 
is any subset of X x X. We interpret an element (it, x) oi X x X as consisting of 
a concept name u and an example x. The example a; is a member of the concept 
u if and only if (u,x) G C. Define the concept represented by u as Kc{u) = {x : 
{u,x) G C}. The set of concepts represented by C is Kc = {Kc{u) : u G X}. 

Along these pages we use two models of learning, all of them fairly standard: 
Angluin’s model of exact learning with queries defined by Angluin and the 
model of PAC-prediction with membership queries as defined by Angluin and 
Kharitonov 

To compare the difficulty of learning problems in the prediction model we use 
a slight generalization of the prediction-preserving reducibility with membership 
queries 

Definition 1. Let C and C be representations of concepts. Let T and T be 
elements not in X . Then C is pwm-reducible to C , denoted C <pwm if ^m-d 
only if there exist four mappings g,f ,h, and j with the following properties: 

1. There is a nondecreasing polynomial q such that for all natural numbers s 
and n and for u G X with |w| < s, g{s,n,u) is a string u' of length at most 
q{s,n, |u|). 

2. For all natural numbers s and n, for every string u G X with |rt| < s, and 
for every x G X with |a;| < n, f{s,n,x) is a string x' and x G Kc{u) if and 
only if x' G Kc{g{s,n^u)). Moreover, f is computable in time bounded by a 
polynomial in s, n, and |a;|, hence there exists a nondecreasing polynomial t 
such that |a;'| < t{s,n, |a;|). 

3. For all natural numbers s and n, for every string u G X with |u| < s, for 
every x' G X, and for every b G {T,T}, h{s,n,x') is a string x G X, and 
j{s,n,x' ,b) is either 1. orT. Furthermore x' G Kc{g{s,n,u)) if and only if 
j{s,n,x' ,b) = T, where 6 = T if x G Kc{u) and b = 1. otherwise. Moreover, 
h and j are computable in time bounded by a polynomial in s, n, and |a;'|. 

Ln (2), and independently in (3), the expression “x G Kc{u)” can be replaced 
with “x ^ Kc{u)”, as discussed in m 
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The following results are obtained adapting slightly some proofs in j^. 

Lemma 1. The pwm-reduction is transitive, i.e., let C,C and C” he represen- 
tations of concepts, if C <pwmC <pwmC" then C <pwmC" . 

Lemma 2. Let C and C be representations of concepts. If C <pwm C and C 
is polynomially predictable with membership queries, then C is also polynomially 
predictable with membership queries. 



3 Clones 

Let D be finite set called domain. An n-adic function over D is a map / : D” — > 
D. Let be the set of all the functions over the domain D. Let 7^ be a class 
of functions over D. Let Vn be the n-adic functions in V . We shall say that V is 
a clone if it satisfies the following conditions: 

Cl For each n > m > 1, V contains the projection function proj„ defined by 

C 2 For each n,m > 1, each f G Vn and each gi, . . . ,gn G Vm- Vm contains the 
composite function h = f[gi,...,gn] defined by 

h{xi,. . .,Xm) = f{gi{xi, . . .,Xm), ■ ■ . , gn{xi, . . . , Xm)) 

If F C is any set of functions over D, there is a smallest clone containing 
all of the functions in F; this is the clone generated by F, and we denote it (F) 
If F = {/i, . . . , fk} is a finite set, we may write (/i, . . . , fk) for (F), and refer to 
fi, . . . , fk as generators of (F) . The set of clones over a finite domain D is closed 
under intersection and therefore it constitutes a lattice, with meet (A) and join 
(V) operations defined by: 

= \Jc^ = {y\c\ 

iGi iei iei \iei / 

There is a smallest clone which is the intersection of all clones; we shall denote 
it Td- It is easy to see that Td contains exactly the projections over D. 



3.1 Boolean Case 

Operations on a 2-element set, say D = {0, 1} are boolean operations. The 
lattice of the clones over the boolean domain was studied by Post, leading to a 
full description 1211 The proof is too long to be included here. We will give only 
the description of the lattice. 

A usual way to describe a poset (F, <) (and a lattice in particular) is by 
depicting a diagram with the relation coverage, where for every a,b G P we say 
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Fig. 1. Post Lattice 



that a covers 6 if 6 < o and if c is an element in P such that b < c < a, then 
either c = a or c = b. The diagram of the lattice of clones on D = {0,1}, often 
called Post’s lattice is depicted in Figured The clones are labeled according 
their standard names. 

A clone C is join irreducible iff C = Ci V C 2 always implies C = Ci or 
C = C 2 . In Figured the join irreducible clones of the diagram are denoted by 
Since {P) = \/ f^-p{f), it follows that the join irreducible clones are generated 
by a single operation, furthermore, it suffices to present a generating operation 
for each join irreducible clone of Post’s lattice. Table 0 associates to every join 
irreducible clone C, its generating operation ipc- 

For a n-ary boolean function / define its dual dual(/) by 

dual(/)(a;i, . . . ,a;„) = 

Obviously, dual(dual(/)) = /. Furthermore, / is self-dual iff dual(/) = /. 
For a class F of boolean functions define dual(F) = |dual(/) : / G F}. The 
classes F and dual(F) are called dual. Notice that (dual(F)) = dual((F)). 

4 The Dichotomy Theorem 

A base F is a finite set of boolean functions |/i, / 2 , • ■ ■ , /nj (/i 1 < * < 

denotes both the function and its symbol). We follow the standard definitions 
of boolean circuits and boolean formulas. The class of boolean circuits over the 
basis F, denoted by CIR(F), is defined to be the set of all the boolean circuits 
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D2 


(PD 2 {x, y, z) = {x Ay)V {x Az)^ {y A z) 


Fr 


9 Ff-(x,y,z) = x\J {y A z) 


Fa°° 


( x , y,z) = x\J {y Az) 


Fi {i > 2) 


y}pi{xi, . . . ,Xi+i) = V • • • V Xj-i V Xj+i V • • • V Xi+i) 




(x, y,z) = X A{y\J z) 


Fi° 


(x, y,z) = xA{yV z) 


Fi {i > 2) 


ifipi{xi,. . . , Xi+i) = A ■ • • A Xj-i A Xj+i A • ■ • A Xi+i) 


Li 


(pLi{x,y,z) = x®y(Sz 


Oi 


0 


Oi 


‘fiOiix) = X 


Os 


V50s(x) = 1 


Os 


<P06(x) = 0 


Pi 


9Pi{x,y) = x Ay 


Si 


9Si{x,y) =x\Jy 



Fig. 2. Generating operations for meet irreducible clones 



where every gate is a function in F. The class of boolean formulas over the basis 
F, denoted by FOR(F) is the class of circuits in CIR(F) with fan-out < 1. 

Given a boolean circuit C over the input variables X\,X 2 , ■ ■ • , a;„, we denote 
by [C] the function computed by C when the variables are taken as arguments 
in lexicographical order. In the same way, given a finite set of boolean func- 
tions F we define [GIR(F)] as the class of functions computed by circuits in 
GIR(F). Similarly, given a boolean formula <P over the variables X\,X 2 , ■ ■ ■ ,Xn, 
we denote by [^] the function computed by ^ when the variables are taken as 
arguments in lexicographical order. In the same way, given a finite set of boolean 
functions F we define [FOR(F)] as the class of functions computed by formulas 
in FOR(F). Given a boolean circuit C G GIR(F), it is possible to construct a 
formula ^ G FOR(F) computing the same function than C (the size of can be 
exponentially bigger than the size of C but we are not concerned with the size 
of the representation). Thus [FOR(F)] = [GIR(F)]. 

In fact, it is direct to verify that the class of functions [GIR(F)] contains the 
projections and it is closed under composition. Therefore, it constitutes a clone. 
More precisely, [GIR(F)] is exactly the clone generated by F, in our terminology 

[GIR(F)] = [FOR(F)] = (F), for all bases F. 

The use of clone theory to study computational problems over boolean formu- 
las was introduced in (231 studying the complexity of some computational prob- 
lems over boolean formulas, such as satisfiability, tautology and some counting 
problems. 

We say that an n-ary function / is disjunctive if there exists some boolean 
coefficients Ci (0 < i < n) such that 



f{xx , . . . ,x„) = Co V (ci A xi) V (c 2 A a; 2 ) V • • • V (c„ A a;„). 
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Similarly, we say that an n-ary function / is conjunctive if there exists some 
boolean coefficients Ci {0 < i < n) such that 



f{xi, . . . , x„) = Co A (ci V xi) A (c 2 V 0 : 2 ) A • • • A (c„ V a;„). 

Accordingly, we say that an n-ary function / is linear if there exists some 
boolean coefficients Ci (0 < * < n) such that 

f{xi, . . . ,Xn) = Co® {ci A Xi) © (C 2 A 0 : 2 ) 0 ■ ■ ■ © (c„ A Xn)- 

Let F be a set of boolean functions. We will say that F is disjunctive (resp. 
conjunctive, linear) iff every function in F is disjunctive (resp. conjunctive, lin- 
ear). We will say that F is basic iff F is disjunctive, conjunctive or linear. 

For any set of boolean formulas (circuits) B, we define Cg as the representa- 
tion of concepts formed from formulas (circuits) in B. More precisely, Cg contains 
all the tuples of the form (C, x) where C represents a formula (circuit) in B and 
a: is a model satisfying C. 

In this section we state and prove the main result of the paper. 

Theorem 1. (Dichotomy Theorem for the Learnahility of Boolean Cireuits and 
Boolean Formulas) Let F be a finite set of boolean functions. If F is basic, then 
CciR(F) l-s both polynomially exactly learnable withn+1 equivalence queries, and 
polynomially exactly learnable with n+1 membership queries. Otherwise, CpoR(F) 
is not polynomially predictable with membership queries under the assumption 
that any of the following three problems are intractable: testing quadratic residues 
modulo a composite, inverting RSA encryption, or factoring Blum integers. 

We refer the reader to Angluin and Kharitonov ^ for definitions of the cryp- 
tographic concepts. We mention that the non-learnability results for boolean 
circuits hold also with the weaker assumption of the existence of public-key en- 
cryption systems secure against CC-attack, as discussed in 0. 



Proof of Theorem 

Learnability of formulas over disjunctive, conjunctive and linear bases is 
rather straightforward. Notice that if a basis F is disjunctive (resp. conjunctive, 
linear) then every function computed by a circuit in CIR(F) is also disjunctive 
(resp. conjunctive, linear). Thus, learning a boolean circuit over a basic basis is 
reduced to finding the boolean coefficients of the canonical expression. It is easy 
to verify that this task can be done in polynomial time with n +1 equivalence 
queries or n + 1 membership queries, as stated in the theorem. 

Now, we study which portion of Post’s lattice is covered by these cases. 
Consider clone Li with generating set Li = (t^o^ , , tpoe j + i 4 )- Since all the 

operations in the generating set of Li are linear, then the clone Li is linear. 
Similarly, clone Pq, generated by operations +O 51 TOe and is conjunctive, 
since so is every operation in the generating set. Finally clone S'g, generated by 
operations +055 +Oe and is disjunctive, since so is every operation in the 
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generating set. Thus, every clone contained in Li, Pq, or Sq is linear, conjunctive 
or disjunctive respectively. Actually, clone Li (resp. Pq, Sq) has been chosen 
carefully among the linear (resp. conjunctive, disjunctive) clones. It corresponds 
to the maximal linear (resp. conjunctive, disjunctive) clone in Post’s lattice. 
It has been obtained as the join of all the linear (resp. conjunctive, disjunctive) 
clones and, in consequence, has the property that every linear (resp. conjunctive, 
disjunctive) clone is contained in Li (resp. Pe, Se)- 

Let us study non-basic clones. With a simple inspection of Post’s lattice 
we can infer that any clone not contained in Li, Sq or Pq contains any of the 
following operations: P’D2- In Section O it is proved that if (F) 

contains any of the previous functions, then the class FOR(F) is not PAC- 
predictable even with membership queries under the assumptions of the theorem. 



4.1 Three Fundamental Non-learnable Functions 

Let B = {AND, OR, NOT} be the usual complete basis for boolean formulas. 
In P], it is proved that Cfor(B) is not polynomially predictable under the as- 
sumptions of Theorem H In this section we generalize the previous result to all 
bases F able to “simulate” any of the following three basic non-learnable func- 
tions: <Pf°°, and This three functions can be regarded as the basic 

causes for non-learnability in formulas. 

The technique used to prove non-learnability results is a two-stage pwm- 
reduction from FOR(F). First, we prove as an intermediate result that monotone 
boolean formulas are as hard to learn, as general boolean formulas. 

Lemma 3. The class CpoR(B) pwm-reducible to Cfor({and,or}) • 

Proof. Let <P he a boolean formula with xi, . . . ,Xn as input variables. We can 
assume that NOT operations are applied only to input variables. Otherwise by 
using repeatedly De Morgan’s law we can move every NOT function towards the 
input variables. 

Consider the formula F with xi, . . . , Xn, yi, . . . , as input variables, ob- 
tained modifying slightly formula F as follows: replace every NOT function ap- 
plied to variable a;^, by the new input variable yi. Finally, we define the formula 
T with X\, . . . ,Xn,yi, ■ ■ ■ ,yn as input variables to be: 



T(a;i, . . . ,a;„,yi, . . ,,yn) 




A /\ {xiV yi) 



l<2<n 
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For every natural number s and for every concept representation u G Cfor(B) 
such that |u| < s, let be the boolean formula with n input variables represented 
by u, let u' be the representation of the monotone boolean formula T obtained 
from as described above. We define g{s,n,u) = u' . For every assignments x 
and y of length n we define /(s, n, x) = xx, h(s, n, xy) = x and 



Clearly, /, g, h and j satisfy the conditions (1), (2) and (3) in Definition ^ 
and therefore define a pwm-reduction. 

Technical note: In the proof of this prediction with membership reduction and 
in the next ones, functions f,g, h,j have been defined only partially to keep the 
proof clear. It is trivial to extend them to obtain complete functions preserving 
conditions (1), (2) and (3). ■ 

Now, we have to see that Cfor({not,and}) is pwm-reducible to Cfor(f) if F 
is able to “generate” any of the following functions: P^Fg° or <pd 2 - Lot us 

do a case analysis. 

Theorem 2. Let F be a set of boolean funetions. If G (F) then 

Cfor({and,or}) pwm-redueible toCFOR(F)- 

Proof. Clone {F) includes the operation (pF^{x,y,z) = xV {y A z). Thus, 
there exists some formula <Pf°° in FOR(F’) over three variables x,y,z such that 
[^F“] = (/^F”. 

Clearly, with the operation tpF^ and the additional help of constant 0 it is 
possible to simulate functions AND and OR, 



Let F be an arbitrary monotone boolean formula in FOR({AND, OR}) over 
the input variables xi,X 2 ■ ■ ■ ,Xn- 

Let T 2 be the boolean formula over the input variables xi, . . . , Xn, cq obtained 
from F by replacing every occurrence AND(a;, y) by Ff^ (cq, a;, y) and, similarly, 
every occurrence of OR(a;, y) by Ff^ {x, y, y). 

Finally, let T} be the boolean formula defined by: 

Ti{xi,X 2 ,...,Xn,co) = Fr^{T 2 {xi,X 2 ■ ..,a:„,co),co,co), 

By construction we have 




AND(a;, y) = tpF^ (0, x, y) 
OR(a;, y) = ipF^ {x, y, y) 



T2{xi,X2,-..,Xn,Q) = F{xx,X2,...,Xn),ax\d 



F\ (xi , X2 , . . . , Xji , Cq) 



F{xi,X 2 , . . . ,a;„) if Co = 0 
1 otherwise 
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Now we are in position to define the pwm-reduction. Let g be the function 
assigning to every monotone boolean formula W, an associated formula T\ in 
FOR(J^) constructed as described above. Let / be the function adding the value 
of the constant zero to the end of string, i.e., 

/(s,n, {xi,X 2 , ■ ■ .,Xn)) = {xi,X 2 , ■ ■ . 

Mapping h produces the inverse result. That is, given an string removes the last 
value (corresponding to the constant 0). 

h{s, n, (xi, . . . , Xn, Co)) = (xi, . . . , x„) 

Finally, function j is defined by 

Thus, it is immediate to verify that /, g, h, and j define a pwm-reduction from 
C{AND.OR} to CpOR(F)- ■ 

By duality we have. 

Theorem 3 . Let F be a set of boolean functions. If G {F) then 

Cfor({and.or}) pwm-reducible toCFOR(F)- 

Finally we study clones containing <fD2- 

Theorem 4 . Let F be a set of boolean functions. If lpd2 € (F) then 

Cfor({and.or}) is pwm-reducible toCFOR(F)- 

Proof. For this proof we will use the fact that operation satisfies the self- 
duality property. Thus, every function in {1^02) is self-dual. Clone {F) includes 
the majority operation ip£,.^{x,y, z) = (xAy) V(xAz) V(yAz). Thus, there exists 
some formula in FOR(F’) over three variables x, y, z such that the function 
computed by F02 is ‘PD2- 

Clearly, with the operation and the additional help of constants it is 
possible to simulate functions AND and OR, 

AND(a;,y) = (0, x, y) 

OR(a;,y) = (1, a;, y) 

For every monotone boolean formula F in FOR({AND, OR}) over the input 
variables xi, X 2 . . . , we construct an associated formula Ti over the variables 
xi,X2 . . . ,Xn,co,ci defined by 

(a^i 5 X2 , . ■ ■ 7 Xn , cq. Cl ) d? (T 2 (a^i J a^2 ■ ■ ■ ? a^n ? cq, ci ) , cq, ci ) , 

where T 2 is the formula over the input variables X\,X2, . ■ . ,Xn,CQ,ci obtained 
from S' in a similar way to the previous proof: we replace every occurrence of 
AND(a:,y) by (cq, x, y) and every occurrence of OR(a;,y) by (ci, a;, y). 
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By construction we have (the case cq = 1 A ci = 0 is a consequence of the 
self-duality of D^), 



■ ■ ■ ,Xn) ifco = 0Aci = l 

-^'P{-^xi,^X2, ■ ■ . if Co = 1 A Cl = 0 
undetermined otherwise 



(^1 J ^2 J • ■ • 7 ^TL 5 OQ 5 Cl ) 



T 



^{xi,X2, ■■.,Xn) 

-^>P(^Xi,^X2,..., 



if Co = Cl = 0 
if Co = Cl = 1 

if Co = 0 A Cl = 1 
-<Xn) if Co = 1 A Cl = 0 



Now we are in position to define the pwm-reduction. Let g be the function 
assigning to every monotone boolean formula W in FOR({AND, OR}), an as- 
sociated formula Ti in FOR(F) constructed as described above. Let / be the 
function adding the value of the constants to the end of string, i.e., 



f{s,n, {xi,X 2 , . ■ -,Xn)) = {xi,X 2 , . . . , 0, 1). 



Mapping h removes the last two values (corresponding to the constants) and, 
moreover, h negates the values of the assignment if the values of the constants 
are flipped. 



h{s,n, (a;i,...,a;„,co,ci)) 



{xi,...,Xn) ifco = 0Vci = l 
(^a;i, . . . , ~^Xn) otherwise 



Finally, function j is defined by 



j'(s,n, (xi , . . . , Xfii Co, Cl), 6) 



0 if Co = 0 A Cl = 0 
b if Co = 0 A Cl = 1 
^5 if Co = 1 A Cl = 0 

1 if Co = 1 A Cl = 1 



Thus, it is immediate to verify that /, g, h, and j define a pwm-reduction from 
C{AND,OR} to CciR(F)- ■ 
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Abstract. A new research frontier in AI and data mining seeks to de- 
velop methods to automatically discover relevant variables among many 
irrelevant ones. In this paper, we present four algorithms that output 
such crucial variables in PAC model with membership queries. The first 
algorithm executes the task under any unknown distribution by measur- 
ing the distance between virtual and real targets. The second algorithm 
exhausts virtual version space under an arbitrary distribution. The third 
algorithm exhausts universal set under the uniform distribution. The 
fourth algorithm measures influence of variables under the uniform dis- 
tribution. Knowing the number r of relevant variables, the first algorithm 
runs in almost linear time for r. The second and the third ones use less 
membership queries than the first one, but runs in time exponential for 
r. The fourth one enumerates highly influential variables in quadratic 
time for r. 

1 Introduction: Terminology and Strategy 

We propose several algorithms with their own character for automatically finding 
relevant variables in the presence of many irrelevant ones. Recent application of 
such algorithms ranges from data mining in the genome analysis to information 
filtering in the network computing. In these applications, sample data consist 
of a huge volume of variables, although the target phenomenon may depend on 
only a few of them. In order for a machine to find such crucial variables, we study 
algorithms in PAC model with membership queries, and analyze their query and 
time complexities. 

To learn an unknown Boolean function that depends on only a small number 
of the potential variables, the learner may (1) find a set of relevant variables and 
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(2) combine them into a hypothesis that approximates the target concept with 
an arbitrarily high accuracy. Most inductive learning algorithms do not separate 
these conceptually different tasks, but rather let them depend on each other. 
On the other hand, some AI learning papers separate them and execute (1) as a 
preprocess for (2) (see [H] for a survey). This paper pursues (1) for the learning 
goal. 

Let us begin with fixing notion of relevant variables. We say that two Boolean 
instances are neighbor of each other if they have the same bit length and differs 
by exactly one bit. 

Definition 1. A variable x is relevant to a function f if there exist neighbor 
instanes A and A such that x{A) ^ x{A') and f{A) f{A'). 

Let Rel(r, n) be the class of Boolean functions with n variables that depend 
on only r of them, or equivalently, have at most r relevant variables. In this 
paper, the target concept is an arbitrary function in Rel(r, n). In other words, 
the learner is given the n Boolean variables, say xi, . . . , Xn, and told that there 
are at most r relevant variables to the target concept 0. 

In this paper we work on Probably Approximately Correct (PAG) Learning 
introduced by Valiant mi , where the target concept / and the target distribution 
T> are postulated and hidden from the learner. The learner receives a sequence of 
examples (A,f{A)), (B,f{B)), . . . independently and randomly from T). From 
these examples, the learner must build a hypothesis that can approximate the 
target concept by an arbitrary accuracy with respect to T>. 

This paper sets a weaker goal for learning. For a Boolean function h let rel(/i) 
denotes the set of relevant variables to h. 

Definition 2. A set V of variables is called an a-dominator, 0 < a < 1, if there 
exists a function h such that rel{h) C V and Probyi{/i(A) = f{A)} > a, where 
A is randomly chosen according to T>. 

A weaker goal of learning is to find an (1 — e)-dominator. Our algorithms 
may use inner coin flips and achieve this weaker goal with probability > 1 — i5 
for arbitrary given constants 0 < e, i5 < 1. 

Unfortunately, random examples are not enough as information resource for 
identifying relevant variables of a given Boolean concept, because: 

— There can be computationally hard problems even though we can information- 
theoretically identify the function. For instance consider the class of conjunc- 
tions of an (log n)-bit majority and an (log n)-bit parity. This class (contained 
in Rel(21ogn, n)) is believed to be hard to predict within polynomial even 
under the uniform distribution Pj. The problem there is that we may have 
enough information to identify the function (using a universal set) but it is 
computationally hard to discover the set of relevant variables. 

^ We can tune our algorithms up into adaptive versions for r by guessing and doubling 
r, that is to execute the algorithms by guessing r = 2°, 2^, 2^, . . . until they succeed. 
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— Moreover, due to reduction from set cover, it is NP-hard in general to discover 
a set of relevant variables on which one can build an accurate hypothesis 0 • 

We will thus allow algorithms to use more active query in addition with 
random examples, that are membership queries introduced by Angluin 

Definition 3. A membership query for an instance A about the target concept 
f is a request for the value MQj(A) = f{A) to the membership oracle MQ^. 

In empirical test, reverse engineer or information search, however, member- 
ship queries are much more expensive than random examples. We will thus 
present econimical algorithms in Section 0 that spend at most r log n member- 
ship queries. 

The algorithms in this paper follow a common learning strategy; Find wit- 
nesses for relevance and execute binary search on them. The algorithms differ in 
methods to discover witnesses for relevance. 

Definition 4. For an instance A and a set V of variables Ay and A'^ are 
instances such that 



Definition 5. A witness for relevance outside of V about the target concept f 
is a pair (A,B) of instances A and B such that Ay = By and f{A) ^ f{B). 

Given a witness (A, B) for relevance, binary search finds a relevant variable 
as follows. It flips half of the different bits between A and B, and ask the mem- 
bership query for the obtained instance C. If f{A) yf f{C) then it repeats the 
argument on a new witness (A, C), otherwise on {B, C). Either case reduces the 
number of different bits in the witnesses by half. This divide-and-query argument 
repeats until reaching to a variable x such that x{A) yf x{B). 

Lemma 1. Given an arbitrary witness for relevance outside of V , the binary 
search outputs a relevant variable x ^ V by using at most log n membership 
queries in 0(n log n) time. 

The proof is folklore (see e.g. [3 Lemma 2.4]). Note that logn is a query 
complexity lower bound in finding one relevant variable from a given witness for 
relevance, hence rlogn is a lower bound in finding r relevant variables. 

2 Measuring the Distance between Virtual and Real 



This section provides a distribution- free algorithm that runs in time almost linear 
to r. It measures the distance between virtual and real targets and if the distance 
is large then the algorithm discovers a new relevant variable. 




Targets 
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Definition 6. The virtual target fv on a set V of variables about the target 
concept f is a Boolean function fv with rel(/y) C V such that fv{Av) = f{Av) 
holds for any instance A. 

In the mistake bound model with membership queries, Blum, Hellerstein and 
Littlestone ^ measures Probyi{/(A) = hv{A)} for a temporal hypothesis h to 
find a new relevant variable that is not yet implemented in h. Here we measure 
Probyi{/(H) = fv{A)}, the distance between / and fv- If it is large, then our 
algorithm finds a witness for relevance (A,Av) with high probability. 

Lemma 2. Let m be any integer with (1 — e)™ < 6. Draw a sample O of m 
examples. If V is not an (1 — e)-dominator then f{A) yf f{Av) happens for 
some {A, f{A)) G O with probability greater than 1 — i5. 

Proof. Suppose that V is not an (1 — e)-dominator. Then the distance between 
/ and fv is greater than e, hence f{A) = f{Av) happens for each {A, f{A)) G 
O with probability <(1 — 

Now we implement this lemma in an algorithm MsrDist and analyze its 
performance. 

Procedure MsrDist 
Input: Integers n, r and mg. 

Output: A set of relevant variables. 

Set P := 0 and m := 0. 

Do until either |P| = r or m = mg 
Draw a random example (A, f{A)) and increment m by 1. 

Let /3 := MQy(Ay). 

If /(A) yf (3 then Do 
Set m := 0. 

Execute the binary search on (A, Ay) and put an obtained variable 
into V. 

EndDo 
EndDo 
Output V. 

EndProc 

Theorem 1. MsrDist finds an (1 — e)-dominator (a set of relevant variables) 
with probability greater than 1 — <5. For some mg = 0{e~^log{r/S)), MsrDist 
uses at most mgr random examples, at most r(mo + log n) membership queries 
and runs in O{nr{mo + logn)) time. 



Proof. Fix an arbitrary mg = 0{e ^ log(r/i5)) such that (1 — £)"*“ < S/r. Then, 
in view Lemma EJ in each stage of the Do loop in the procedure MsrDist where 
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V is not an (1 — £)-dominator, each example (A, f(A)) may be a witness {A, Ay) 
for relevance outside of V with probability greater than S/r. Since Msrinf can 
draw Too examples in the stage, it will succeed in find a witness with probability 
greater than 1 — (1 — > 1 — S/r. Since there are at most r stages, due to 

the union bound, every stage will succeed with probability > 1 — c5/r-r> 1 — <5, 
so with this probability MsrDist outputs an (1 — e)-dominator. 



3 Exhausting Virtual Version Spaces 

This section presents two algorithms that saves membership queries than one 
in the previous section. One algorithm is distribution-free and exhausts virtual 
version spaces, while another works under the uniform distribution and exhausts 
an r-universal. 

Intuitively, a version space is the set of hypotheses under consideration in 
inductive learning. Haussler jOj proposed to exhaust the version space by throw- 
ing the hypotheses away that are inconsistent with drawn examples until only 
accurate hypotheses may remain. In this section, for a set V of variables. 

Definition 7. the virtual version space on V is the set of Boolean funetions h 
with rel(/i) C V . 

Let V be the set of already found relevant variables. Then virtual version 
spaces may expand at each moment that a new relevant variable is discovered 
and added to V. Unless V is an (1 — e)-dominator, exhausting the virtual version 
space on V is shown to provide a witness for relevance outside of V with high 
probability. 



Lemma 3. Let m be any integer such that (1 — e)™2^ < 5. Draw a sample O of 
m random examples. IfV is not an (1 — e)-dominator then there exists a witness 
(A,B) for relevance outside ofV with {A, f (A)) ,{B , f (B)) G O with probability 
at least 1 — <5. 

Proof. Suppose that V is not an (1 — e)-dominator. Then for every function h 
in the virtual version space on V, h{A) ^ f{A) holds with probability greater 
than e, so O is consistent with h with probability < (1 — e)"*. The union bound 
thus implies that O does not provide any witness for relevance outside of V with 
probability < (1 — e)"*2^ < S. 

We now implement this lemma in an algorithm ExhVVS that finds an (1 — 
e)-dominator. For each new example (A,f{A)), ExhVVS checks over the old 
examples (B,f{B)) in a stock that whether (A,B) is a witness for relevance 
outside of V. If ExhVVS finds a new witness it discards all the old examples and 
make the stock empty. 
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Procedure ExhVVS 
Input: Integers n and r. 

Output: A set of relevant variables. 

Set O := 0, V := 0, s := 1 and m ~ 0. 

Initialize mo := the minimum integer such that (1 — > 5/r. 

Do until either \V\ = r or m = mo 
Draw a random example {A,f{A)). 

For each B £ O Do 

liAv = Bv and f{A) 7 ^ f{B) then Do 
Update O := 0, m := 0 and s := s + 1. 

Update mo := the minimum integer such that (1 — > 5/r. 

Execute the binary search on {A, B) and put an obtained variable 
into V. 

EndDo 

EndDo 

Put (A,f{A)) into O and increment m by 1. 

EndDo 
Output V. 

EndProc 

Theorem 2. The procedure ExhVVS outputs an (1 — e)-dominator (a set of 
relevant variables) with probability greater than 1 — (5. ExhVVS uses at most mo = 
0(r2'' log(l/e) log(r/(5)) random examples, rlogn membership queries and runs 
in O{n{mo + rlogn)) time. 

Proof. If |U| = s then ExhVVS draws mo random examples so that (1 — e)™‘’2^ 

> S/r. Therefore, due to Lemma 0 if y is not an (1 — e)-dominator then it 
derives a witness for relevance outside of V with probability at least S/r. The 
remaining argument is the same with the proof of Theorem 0 

Under the uniform distribution, a modification of ExhVVS can find all the 
relevant variables by exhausting an r-universal set. 

Definition 8. A set U of instances is called r-universal if every r-bits on every 
set of r variables occurs in some instance A in U. 

Damaschke [Z] worked in the exact learning model with only membership 
queries and studied the numbers of (adaptive and non-adaptive) membership 
queries for exhausting r-universal set. Here we show that, under the uniform 
distribution, an enough number of random examples provides an r-universal set 
and that any of those sets is a sufficient source for witnesses for all relevant 
variables. 

Lemma 4. Let m be any integer such that (1 — 2“’’)'"2’’(") < S. Draw m in- 
stances independently and randomly under the uniform distribution over the n-bit 
instance space. Then it forms an r-universal set with probability at least 1 — 5. 
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Proof. We say that the sample hits an r-bits on a set of r variables if an assign- 
ment A of some example (A,f{A)) in the sample sets those variables to those 
bits. Then, due to probabilistic independence of examples, the sample does not 
hit a given r-bits with probability (1 — 2“’')"*. Since there are 2’'(") possibility of 
such r-bits’s, the union bound implies that the sample does not hit some r-bits 
with probability at most (1 — 2“’')"*2’' (") < S. Or equivalently, the sample hits 
every r-bits with probability at least 1 — 5. 

Let ExhUniv be a modification of ExhVVS that does not discard old examples 
at all; ExhUniv omits updating O and mg in the lines 9 and 10 of the procedure 
ExhVVS. 

Theorem 3. Under the uniform distribution, ExhUniv finds all the relevant 
variables with probability at least 1 — 5 by at most mg = 0(r2’’ lognlog(l/5)) 
random examples and r logn membership queries in 0(n(mg -I- r logn)) time. 

Proof. Let R be the set of relevant variables and let s = |i?| < r. Let mg satisfy 
(1 — 2“®)"*°2®(") < 5. Then, due to Lemma 0, the sample of size mg presents 
an s-universal set with probability >1 — 5. Such a sample induces every s-bits 
on R, so in particular ExhUniv discovers a witness {A, B) for relevance of every 
variable x G R such that Aji and Bn are neighbors at x, hence x itself by binary 
searching on (A,B). 

4 Measuring Influence of Variables 

An algorithm in this section measures influence of variables to the target concept 
under the uniform distribution and outputs highly influential ones. Such an 
approach for learning has been taken in many AI and data mining research 
papers and achieves good empirical success (see [SI Section 2.4]). Based on this 
approach, we will design an algorithm that finds an (1 — e)-dominator under 
the uniform distribution in time almost quadratic for r, the number of relevant 
variables. 

Definition 9. The influence Inf (a;) of a variable x is the number of instances A 
as a fraction of the set of all instances such that f{A) ^ f(A') for the neighbor 
instance A' of A at x. 

Therefore, Inf (a;) = 0 if and only if a variable x is irrelevant, and Inf (a;) = 1 
if and only ii f = x(Bg for some function g irrelevant with respect to x. To show 
existence of highly influential variables, Ben-Or and Linial 0 applied the edge- 
isoperimetric inequality (see jSl Section 16] for edge-isoperimetric inequalities). 

Lemma 5 (Edge Isoperimetric Inequality). For a given Boolean function 
f{xi , . . . , Xn) choose b G {0, 1} and 1 < k such that 

2-fc-i < ProbA{/(A) = b}< 2“'= 

under the uniform distribution on A. We then have . 
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In order to implement this inequality in a learning algorithm, we need to 
relatize it on a given set V of variables. 

Lemma 6. IfV is not an (1 — e)-dominator then > 2e. 

Proof. Let h he a Bayes optimal predictor of / on V. That is, (1) rel(/i) C V 
and (2) h{A) = 1 if and only if ProbB{/(S) = l\Bv = Ay} > 1/2. We suppose 
^ and prove that h approximates / with accuracy > 1 — e. 

For any instance A we let 

p{A) = 1 - max{ProbB{/(S) = l\By = Ay},ProhB{f{B) = 0\By = Ay}}. 

Then expectation of p{A) on A is Pa[p{A)] = Probyi{/i(A) ^ f(A)}, so it is 
enough to claim that Pia[p{A)] < e. 

For any instance A let Ja,v be the function that fixes the input-bits of / 
on V by A. Hence iel{fA,v) C rel(/) — V. Lemma 0 then promises 2p(A) < 
( 2^)1 so taking average on A for both sides derives 

2Ea[p{A)] < ^ Inf/(a;) < 2e, 

x^V 

so we obtain Ea[p{A)] < e. 

Uehara et. al. m apply the following lemma for finding relevant variables 
to fundamental Boolean functions (conjunctions, parities, etc). 

Lemma 7 (Uehara et. al. m)- Let R be any set of r elements in the n 
element set X. Choose a subset W of X with \W\ = n/r uniformly at random. 
Then \V n W\ = 1 happens with probability > 1/e. 

Now, we present an algorithm that applies Lemma El and Lemma Eland finds 
highly influential variables. 

Procedure Msrinf 
Input: Integers n, r and mo. 

Output: A set of relevant variables. 

Set X := {xi, . . . , Xn}, F := 0, m := 0, n' := n and r' r. 

Do until either r' = 0 or m = mo 
Draw a random example {A,f{A)) and increment m by 1. 

Choose a set of variables W C X — V with |IF| = n'/r' uniformly at random. 
Let /3 := MQ^(A'^).ppp 
If f(A) A P then Do 

Update m := 0, n' -.= n' — 1 and r' := r' — 1. 

Execute the binary search on (A,A^), put the obtained variable in V and 
remove the variable from X. 

EndDo 
EndDo 
Output V. 

EndProc 
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Theorem 4. Under the uniform distribution, Msrinf finds an (l — e)-dominator 
(a set of relevant variables) with probability greater than 1 — i5 by at most toq 
= 0((r/e) log(r/i5)) random examples and r(mo + log n) membership queries in 
0(nr (too + log n)) time. 

Proof. Fix an arbitrary toq = 0(e“^ log(r/i5)) such that (1 — < 5/r. Let 

R be any set of r variables containing all the relevant variables. 

In view of Lemma 0 choosing W as in the procedure, we have |(i? — V) n W\ 
= 1 with probability greater than 1/e. Moreover, if it is so, {R — V)UW = {a;} 
happens equally likely for each x € R — V. Therefore, drawing A according to T> 
and choosing W as in the procedure, f(A) f{A^) happens with probability 
greater than ^ T^^us if V is not an (1 — e)-dominator then 

Lemma 0 promises ^ {A,A^) happens to be a witness for 

relevance outside of V with probability greater than — . 

Therefore, in each stage of the Do loop, if V is not an (1 — e)-dominator 
then a witness for relevance is successfully found with probability greater than 
1 — (1 — > 1 — S/r. The remaining argument is the same with the proof 

of Theorem 0 

The procedure Msrinf enumerates variables in order of their influence to the 
target. For example, suppose that there are 100 relevant variables where only 
three of them have influence 0.1 and the other 97 have only 0.001. Mseinf gets 
the three in precedence with the other 97 with probability > (0.9)^ = 0.729. 
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Abstract. In this paper we introdnce a general method that allows to 
prove tight linear inequalities between different types of predictive com- 
plexity and thus we generalise our previous results. The method relies 
upon probabilistic considerations and allows to describe (using geomet- 
rical terms) the sets of coefficients which correspond to true inequalities. 
We also apply this method to the square-loss and logarithmic complex- 
ity and describe their relations which were not covered by our previous 
research. 



1 Introduction 

This paper generalises the author’s paper 0. In 0 we proved tight inequalities 
between the square-loss and logarithmic complexities. The key point of that pa- 
per is a lower estimate on the logarithmic complexity, which follows from the 
coincidence of the logarithmic complexity and a variant of Kolmogorov complex- 
ity. That estimate could not be extended for other types of predictive complexity. 
The main theorem (Theorem 7) of ^ pointed out the “accidental” coincidence 
of two real-valued functions but P) did not explain the deeper reasons beyond 
this fact. 

In this paper we use a different approach to lower estimates of predictive 
complexity. It relies upon probabilistic considerations and may be applied to 
a broader class of games. The results of 0 become a particular case of more 
general statements and receive a more profound explanation. 

As an application of the general method we establish linear relations between 
the square-loss and logarithmic complexity we have not covered in 

When giving the motivations for considering predictive complexity, we will 
briefly repeat some points from 

We work within an on-line learning model. In this model, a learning algorithm 
makes a prediction of a future event, than observes the actual event, and suffers 
loss due to the discrepancy between the prediction and the actual outcome. 
The total loss suffered by an algorithm over a sequence of several events can be 
regarded as the complexity of this sequence with respect to this algorithm. An 
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plexity: recursion-theoretic variants”) and by ORS Awards Scheme. 



O. Watanabe, T. Yokomori (Eds.): ALT’99, LNAI 1720, pp. 323-|2SH 1999- 
(c) Springer- Verlag Berlin Heidelberg 1999 



324 



Yuri Kalnishkan 



optimal or universal measure of complexity cannot be defined within the class 
of losses of algorithms, so we need to consider a broader class of “complexities”, 
namely, the class of optimal superloss processes. In many reasonable cases, this 
class contains an optimal element, a function which provides us with the intrinsic 
measure of complexity of a sequence of events with respect to no particular 
learning strategy. 

The concept of predictive complexity is a natural development of the theory 
of prediction with expert advice (see P H E]) and it was introduced in the 
paper Q. In the theory of prediction with expert advice we merge some given 
learning strategies. Roughly speaking, predictive complexity may be regarded as 
a mixture of all possible strategies and some “superstrategies” . 

The paper m introduces a method that allows to prove the existence of 
predictive complexity for many natural games. This method relies upon the 
Aggregating Algorithm (see 0) and works for all so-called mixable games. It 
is still an open problem whether the mixability is a necessary condition for the 
existence of predictive complexity but, in this paper, we restrict ourselves to 
mixable games. 

In Sect. 121 we give the precise definition of the environment our learning 
algorithms work in. A particular kind of environment (a particular game) is 
specified by choosing an outcome space, a hypothesis space, and a loss function. 
A loss function measures the loss suffered by a prediction algorithm in this 
environment and thus it is of interest to compare games with the same outcome 
space and the same hypothesis space but different loss functions. In this paper 
we compare the values of predictive complexity of strings w.r.t. different games 
and therefore we compare the inherent learnability of an object in different 
environments. 

Our goal is to describe the set of pairs (o, b) such that the inequality aJC^{x) -|- 
b\x\ >+ K?{x) holds for complexities lO and K? specified by mixable games ©i 
and 02- In Sect. El we formulate necessary and sufficient conditions for inequal- 
ities aK}{x) + 6|x| >+ K?{x) and a\K}{x) + a 2 K?{x) <+ b\x\ to hold. We 
establish both geometrical and probabilistic criteria. It is remarkable that both 
inequalities hold if and only if their counterparts hold “on the average” . 

In Sect. 0 we apply our results to relations between the square- loss and 
logarithmic complexity we have not investigated before. 

2 Definitions 

The notations in this paper are generally the same as in the paper El but some 
extra notations will also be introduced. 

We will denote the binary alphabet {0,1} by B and finite binary strings by 
bold lowercase letters, e.g. x,y. The expression \x\ denotes the length of x and 
B* denotes the set of all finite binary strings. 

We use the following notations, typical for works on Kolmogorov complexity. 
We will write f{x) <+ g{x) for real-valued functions / and g if there is a constant 
C > 0 such that f{x) < g{x) + C for all x from the domain of these functions 
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(the set B* throughout this paper). We consider mostly logarithms to the base 
2 and we denote log 2 by log. 

We begin with the definition of a game. A game 0 is a triple (J7, F, A), where 
17 is called an outcome space, F stands for a hypothesis space, and A : 17 x T — s- 
IRU {+oo} is a loss function. We suppose that a definition of computability over 
17 and F is given and A is computable according to this definition. 

Admitting the possibility of A(iu,j) = +oo is essential (cf. 0). We need this 
assumption to take the very interesting logarithmic game into consideration. 
The continuity of a function / : M — > IR U {+oo} in a point xq G M such that 
/(xo) = +00 is the property lim.x^xo,xeM f(x) = +oo (the continuity in the 
extended topology). 

Throughout this paper, we let 17 = B = {0,1} and F = [0,1]. We will 
consider the following examples of games: the square-loss game with 

A(w,7) = (w-7)2 (1) 



and the logarithmic game with 



A(w, 7 ) 



f— log(l — 7 ) if tu = 0 
} — log 7 if w = 1 . 



( 2 ) 



A prediction algorithm 21 works according to the following protocol: 



FOR t = 1,2,... 

(1) 21 chooses a hypothesis 74 G T 

( 2 ) 21 observes the actual outcome tUt G 17 

(3) 21 suffers loss X(uJt,"ft) 

END FOR. 



Over the first T trials, 21 suffers the total loss 

T 

Loss2i(wi,W2,... ,wt) = ^A(wt,7t) . (3) 

t=i 

By definition, put Lossai(A) = 0, where A denotes the empty string. A function 
L : 17* ^ IR U {+ 00 } is called a loss process w.r.t. 0 if it coincides with the loss 
L 0 SS 21 of some algorithm 21. Note that any loss process is computable. 

We say that a pair (so,si) G [— 00 , + 00 ]^ is a superprediction if there exists 
a hypothesis 'y G F such that sq > '^( 0 j 7 ) and si > A(l, 7 ). If we consider 
the set P = j(po,Pi) € [- 00 , + 00 ]^ | G T : po = A( 0 , 7 ) and pi = A(l, 7 )| 
(cf. the canonical form of a game in 0), the set S of all superpredictions is 
the set of points that lie “north-east” of P. We will loosely call P the set of 
predictions. The set of predictions P = {( 7 ^, (1 — 7 )^) | 7 G [0,1]} and the set 
of superpredictions S for the square-loss game are shown on Fig.0 

A function L : 17* ^ IR U {-Foo} is called a superloss process w.r.t. 0 if the 
following conditions hold: 
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Fig. 1. The sets of predictions and superpredictions for the square-loss game. 



— HA) = 0, 

— for any x G f2*, the pair {L{x0) — L{x), L{xl) — L{x)) is a superprediction 
w.r.t. 0, and 

— T is semicomputable from above. 

We will say that a superloss process K is universal if it is minimal to within 
an additive constant in the class of all superloss processes. In other words, a 
superloss process K is universal if for any other superloss process L there exists 
a constant C such that 



\/x G 17* : K{x) < L{x) + C . (4) 

The difference between two universal superloss processes w.r.t. 0 is bounded by 
a constant. If superloss processes w.r.t. 0 exist we may pick one and denote it 
by /C®. It follows from the definition that, for any L which is a super loss process 
w.r.t. 0 and any prediction algorithm 21, we have 

/C®(£c) <+ L(£c) , (5) 

Loss§(a;) , (6) 

where Loss® denotes the loss w.r.t. 0. One may call /C® the complexity w.r.t. 0. 

Note that universal processes are defined for concrete games only. Two games 
01 = (17, r, Ai) and 02 = (17, T, A 2 ) with the same outcome and hypothesis 
spaces but different loss functions may have different sets of universal superloss 
processes (e.g. 0i may have universal processes and 02 may have not). 

We now proceed to the definition of a mixable game. For any A <G [— 00 , -|-oo]^ 
and any (m, v) G the shift A + {u, v) is the set {{x + u,y + v) \ {x, y) G A}. 
For the sequel, we also need the definition of the expansion aA = {{ax, ay) \ 
(x,y) G A}, where a S IR. For any B C [— 00 , -|-oo]^, the A-closure of B is the 
set 

Pi A+{u,v) . 

(u,v)G]R^ : BGA-\-{u,v) 



cIa{B) 



( 7 ) 
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We let 

Aq = {{x,y) G [-cxo,+oo]^ I a; > 0 or y > 0} , (8) 

A^ = {{x,y)e[-^,+cx,f\(3^ + (3y<l}. (9) 

On Fig. Hyou can see the sets Aq and A^ for (3 = 1/3. 



4 

2 


^0 


-4 -2 

-2 

-4 


2 4 




Fig. 2. The sets Aq and AA^^ . 



Definition 1 (KE!1)- A game © is mixahle if there exists (3 G (0, 1) such that 
for any straight line I C passing through the origin the set (cl^ia S \ S)r\l 
contains no more than one element, where S is the set of superpredictions for 

0 . 



Proposition 2 (^). For any mixable game, there exists a universal superloss 
process. 



Proposition 3 (I3E]). The logarithmic and the square-loss games are mixable 
and therefore the complexities and exist. 



3 General Linear Inequalities 

In this section we prove some general results on linear inequalities. Throughout 
this section ©i and ©2 are any games with loss functions Ai and A 2 , sets of 
predictions Pi and P 2 , and sets of superpredictions Si and S 2 , respectively. The 
closure and the boundary of M C in the standard topology of IR^ are denoted 
by M and dM, respectively. 
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3.1 Case aK}{x) + b\x\ >+ K.^{x). 

The following theorem is the main result of the paper. 

Theorem 4. Suppose that the games 0i and 02 are mixable and specify the 
complexities Kf and Kf ; suppose that the loss function Ai(w, 7 ) is continuous in 
the second argument; then the following statements are equivalent: 

(i) 3C >_0Va; ■. JC^{x) + C > lO{x), 

(ii) Pi Q S 2 , 

(ill) Vp e [0,1]3C > OVn e IN : + C > 

where are results of n independent Bernoulli trials with the 

probability of 1 being equal to p. 

Loosely speaking, the inequality IC^{x) >+ Kf{x) holds if and only if the 
graph 



{(Ai(0,7),Ai( 1,7)) Iqe [0,1]} (10) 

lies “north-east” of the graph 

{(A 2 ( 0 , 7 ), A 2 ( 1 , 7 )) I 7 G [0, 1]} . (11) 

Proof. The implication (i) (Hi) is trivial. 

Let us prove that (ii) (i). Suppose that Pi C S 2 and therefore Si C 82 . 
Let L be a superloss process w.r.t. 0i. It follows from the definition, that, for 
any x G B*, we have (P(a;0) — L(x), L(xl) — L(x)) G C 82. One can easily 
check that L(x) -I- (1 — ^f^) is a superloss process w.r.t. 02 . If we take L — Kf 
and apply u we will obtain (i). 

It remains to prove that (Hi) (ii). Let us assume that condition (ii) is 
violated i.e. there exists 7 O) g [ 0 , 1 ] such that 

(Ai(0,7(°^Ai(1,7^°)) = (uo,^^o)^5; . (12) 

Since Ai is continuous, without loss of generality we may assume that 7 O) is a 
computable number. We will now find po G [0, 1] such that 

= n(n) . (13) 

We need the following lemmas. 

Lemma 5. Let & be a mixable game; then the set of superpredictions for 0 is 
convex. 

Proof (of the lemma). 

Let S be the set of superpredictions for 0 and let (3 G (0, 1) be the number 
from Definition [D Consider two points D,E G S and the line segment [D,E] C 
We will prove that [D, E] C ]R^. 
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If one of the points lies “north-east” of another i.e. 

{D,E} = {{xi,yi),{x2,y2)} , (14) 

where x\ > X 2 and y\ > 7 / 2 ), then there is nothing to prove. Now suppose that 
D = E = {x 2 ,y 2 ), xi < X 2 , and yi > y 2 (see Fig. |^. There exists a 

shift A! of the set such that D,E G dA' . Let us prove that the closed set 
M bounded by the line segment [D, E] and the segment of dA' lying between D 
and if is a subset of cl^/s S. 




Fig. 3. The set Q O M from the proof of Lemma 0 is coloured grey. 



Consider a shift A” of the set A^ such that D,E G A" . Trivially, two different 
shifts of dA^ can have no more than one point in common. It follows from 
D,E G A" that dA" intersects the rays ri = {(a;i,?/) | y < j/i} and V 2 = 
{(a; 2 , y) I y < 2 / 2 }- The continuity of dA^ yields M C A" . 

If there exists a point F = {xo,yo) G [D,E] such that F ^ S, then the 
whole quadrant Q = {{x,y) \ x < Xq and y < yo} (see Fig. Oj) has no common 
points with S and the set Q C M C (cl^/s S \ cl^o S) violates the condition of 
Definition QJ 

□ 

Lemma 6. Let M C IR^ be a convex set closed in the standard topology o/M^ 
and (uo,vo) ^ AI . Suppose that for any u,v > 0 we have M + {u,v) C M. Then 
there exist po G [Oj 1] O'T^d m 2 G IR such that, for any {u, v) G M , we have 

Pov -I- (1 - po)u > m 2 > TOi = povo + (1 - Po)uo ■ (15) 

Proof (of the lemma). 

The lemma can be derived from the Separation Theorem for convex sets 
(see e.g. 0) but we will give a self-contained proof. Let us denote {uo,vq) by 
D. It follows from M being closed, that there exists a point E G M which is 
closest to D. Clearly, all the points of M lay on one side of the straight line I 
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which is perpendicular to DE and passes through E and D lays on the other 
side (see Fig. i|). The straight line I should come from the “north-west” to the 
“south-east” and therefore normalising its equation one may reduce it to the 
form pqv -I- (1 — po)u = m 2 , where po G [0, 1]. 




Fig. 4. The set M from the proof of Lemma 0 is coloured grey. 



Lemma 7. Suppose that a game © is mixahle and the set S of superpredictions 
for 0 lies “north-east” of the straight line pv {1 — p)u = m i.e. 

V(it, v) G S ■. pv {I— p)u > m , (16) 

where p G [0, 1]. If JC is the complexity w.r.t. 0, than 

E/C(e|^^..^(P))>mn . (17) 

Proof (of the lemma). Consider a superloss process L and a string x. The point 
(L(a;0) — L{x),L{xl) — L{x)) = (si, S 2 ) is a superprediction. We have 

E(L(a;^(p)) - L(a;)) =poS 2 + (1 -po)si , (18) 

where is a result of one Bernoulli trial with the probability of 1 being equal 
to p. 

□ 

Lemma 8. Let & be a mixable game with the loss function A and the complexity 
1C. Suppose that 7 ^°^ G [0, 1] is a computable number such that 

PoA( 1 , 7 ^°^) -f ( 1 -Po)A( 0 , 7 (°)) = m , 

where po G [0, 1]; then there exists C > 0 such that 

EIC{f[P°\..^!;f“^) <mn + C . 



(19) 
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Proof (of the lemma). The proof is by considering the strategy which makes the 
prediction 7 *^°^ on each trial and applying Q. □ 

The theorem follows. □ 

For any game 0 with the loss function A, any positive real a, and any real b, 
one may consider the game &a,b with the loss function Xa,b = aX + b. Any L(x) 
is a superloss process w.r.t. 0 if and only if aL{x) + b\x\ is a superloss process 
w.r.t. 0a, fc. This implies the following corollary. 

Corollary 9. Under the conditions of Theorem ^ the following statements are 
equivalent: 

(i) 3C > OVa; G aJC^{x) + b\x\ +C> JO{x) 

(a) aPi + {b, b) C S 2 

(ill) Vp G [0, l\3C > OVn G IN : + bn + C> 

(p) iv) 

where . ,fn are results of n independent Bernoulli trials with the 

probability of 1 being equal to p. 

It is natural to ask whether the extra term b\x\ can be replaced by a smaller 
term. The next corollary follows from the proof of Theorem El and clarifies the 
situation. 

Corollary 10. Suppose that under the conditions of Theorem ^ the following 
statement holds: 

For any p G [0, 1] there exists a function Op : IN ^ IR such that Op(n) = 
o(ri)(n — *■ +00) and for any n G IN the inequality 

...^(p)) + bn + Op(n) > ■ ■ ■ &) (20) 

(p) (p) 

holds, where . ,^n are results of n independent Bernoulli trials with the 

probability of 1 being equal to p. 

Then the inequality 

aK}{x) + b\x\ >■*■ K.^{x) 

holds. 

Proof. The corollary follows from (int □ 

Corollary 11. If under the conditions of Theorem ^ there exists a function 
/ : IN — > IR such that f{n) = o(n) (n +00) and, for any x G B*, the 
inequality 

oK} { x) + b\x\ + f {\x\) > K? {x) (21) 

holds, then the inequality 

aK}{x) + b\x\ >■*■ JC^{x) 

holds. 



332 



Yuri Kalnishkan 



The next statement shows a property of the set of all pairs (a, b) such that 
a > 0 and the inequality aJC^(x) + b\x\ >+ K?{x) holds. 

Corollary 12. Under the conditions of Theorem^ the set 

{{a,b) I a> Oand 30 OVa: e B* : aJC\x) + b\x\ + C > JC^ (x)} (22) 

is closed in the topology o/IR^. 

Proof. The proof is by continuity of Ai- □ 

3.2 Case axK^{x) + a 2 K?{x) <+ b|a;|. 

In the previous subsection we considered nonnegative values of a. In this sub- 
section we study the inequality aK}{x) + b\x\ >+ K?{x) with negative a or, in 
other words, the inequality a\K}{x) + a 2 K?{x) <+ b\x\ with ai,a2 > 0. 

Theorem 13. Suppose that games 0i and ©2 mixahle and, for any 7 G 
[0, 1], we have 

Ai(0,7) = Ai(1,1-7) , (23) 

A2(0,7) = A2(1,1-7) , (24) 

where Ai and A2 are the loss functions. Then, for any 01,02 > 0, the following 
statements are equivalent: 

(i) 3C > OVa: G B* : aiJC^{x) + a 2 lO{x) < b\x\ + C, 

(n) oiAi(0, 1/2) -k 02A2(0, 1/2) < b, 

(lii) 30 OVn G IN : + < bn+C, 

where • . . , are results ofn independent Bernoulli trials with the 

probability of 1 being equal to 1/2. 

Proof. The proof is similar to the one of Theorem 0 but a little simpler. 

Lemma 14. Suppose that a game 0 is mixahle, A is its loss function, and JC 
is the complexity w.r.t. 0. If, for any 7 G [0,1], we have A(0,7) = A(l,l — 7), 
then, for any x G B*, we have 

/C(a:) <+ A(0,l/2)|x| . (25) 

Proof (of the lemma). The proof is by considering the strategy which makes the 
prediction 1/2 on each trial and applying (0). □ 

Clearly, the sets of superpredictions Si and S 2 for 0i and 02 lay “north- 
east” of the straight lines x/2 + y/2 = Ai(0, 1/2) and x/2 + y/2 = A2(0, 1/2), 
respectively. It follows from Lemma 0 that, for any n G IN, the inequalities 

, (26) 

. . .^(1/2)) > A2(0, l/2)n (27) 



hold. The theorem follows. 



□ 
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4 Application to the Square-Loss and Logarithmic 
Complexity 



In this section we will apply our general results to the square-loss and logarithmic 
games. 

Theorem 15. If a > 0, then the inequality 

aK}°^{x)+b\x\>+ (28) 
holds if and only if b > max(j — a, 0). 

Proof. We apply Corollary 0 

Let p G [0, 1]. To estimate the expectations, we need the values 

:= min (p(l -7)2 -p (1 -p)y2) (29) 

0 < 7<1 

= p{l-p) (30) 



and 



^log 



min (-P log 7 - (1 -p)log(l - 7)) 
0 < 7<1 

-plogp- (1 -p)log(l -p) . 



(31) 

(32) 



It follows from Lemmas Q and 0 that there are Ci,C2 > 0 such that, for any 
p G [0, 1] and for any n G IN, we have 

El°^n\ < Cl , (33) 

<C2 . (34) 

Therefore the inequality a/C*°®(a:) -I- 6|a;| >+ /C®'^(x) holds if and only if for any 
p G [0,1] the inequality + b> Ep^ holds. 

Lemma 16. For any a > 0, we have 

sup {E^ — aEff^) = max(- — a, 0) . (35) 

pe[o.i] ^ ^ 4 

The theorem follows. □ 



The next theorem corresponds to Subsect. El 
Theorem 17. For any 01,02 > 0 and any b the inequality 
oi/C®'^(x) -I- 02 /C*°®(a;) <"*■ 5|a;| 
holds for any x G B* if and only if oi/4 -|- 02 < 5. 

Proof. The proof is by applying Theorem 1 1 31 □ 
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Abstract. We design efficient on-line algorithms that predict nearly as 
well as the best pruning of a planar decision graph. We assume that 
the graph has no cycles. As in the previous work on decision trees, we 
implicitly maintain one weight for each of the prunings (exponentially 
many). The method works for a large class of algorithms that update its 
weights multiplicatively. It can also be used to design algorithms that 
predict nearly as well as the best convex combination of prunings. 



1 Introduction 

Decision trees are widely used in Machine Learning. Frequently a large tree 
is produced initially and then this tree is pruned for the purpose of obtaining 
a better predictor. A pruning is produced by deleting some nodes and with 
them all their successors. Although there are exponentially many prunings, a re- 
cent method developed in coding theory jWS TDbj and machine learning |ICT2| 
makes it possible to (implicitly) maintain one weight per pruning. In particular 
Helmbold and Schapire |HS97j use this method to design an elegant algorithm 
that is guaranteed to predict nearly as well as the best pruning of a decision 
tree. Pereira and Singer modify this algorithm to the case of edge-based 

prunings instead of the node-based prunings defined above. Edge-based prunings 
are produced by cutting some edges of the original decision tree and then re- 
moving all nodes below the cuts. Both definitions are closely related. Edge-based 
prunings have been applied to statistical language modeling where the 

out-degree of nodes in the tree may be very large. 

In this paper we generalize the methods from decision trees to planar directed 
acyclic graphs (dags). Trees, upside-down trees and series-parallel dags are all 
special cases of planar dags. We define a notion of edge-based prunings of a 
planar dag. Again we find a way to efficiently maintain one weight for each of 
the exponentially many prunings. 

In Fig.^ the tree T' represents a node-based pruning of the decision tree 
T. Each node in the original tree T is assumed to have a prediction value in 
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Fig. 1. An example of a decision tree, a pruning, and a decision dag 

some prediction space Y. Here, we assume Y = [0, 1]. In the usual setting each 
instance (from some instance space) induces a path in the decision tree from the 
root to a leaf. The path is based on decisions done at the internal nodes. Thus 
w.l.o.g. our instances are paths in the original decision tree. For a given path, a 
tree predicts the value at the leaf at which the path ends. For example, for the 
path {a, c, /, f}, the original tree T predicts the value 0.2 and the pruning T' 
predicts 0.6. 

In what follows, we consider the prediction values to be associated with the 
edges. In Fig.QJthe prediction value of each edge is given at its lower endpoint. For 
example, the edges a, b and c have prediction values 0.4, 0.6 and 0.3, respectively. 
Moreover, we think of a pruning as the set of edges that are incident to the 
leaves of the pruning. So, T and T' are represented by {d, e, g, h, i\ and {b, /, g}, 
respectively. Note that for any pruning R and any path P, R intersects P at 
exactly one edge. That is, a pruning “cuts” each path at an edge. The pruning 
R predicts on path P with the prediction value of the edge that is cut. 

The notion of pruning can easily be generalized to directed acyclic graphs. 
We define decision dags as dags with a special source and sink node where each 
edge is assumed to have a prediction value. A pruning R of the decision dag 
is defined as a set of edges such that for any s-t path P, R intersects P with 
exactly one edge. Again the pruning R predicts on the instance/path P with 
the value of the edge that is cut. It is easily seen that the rightmost graph G in 
Fig.Q is a decision dag that is equivalent to T . 

We study learning in the on-line prediction model where the decision dag is 
given to the learner. At each trial t = 1, 2, . . ., the learner receives a path Pt and 
must produce a prediction yt G Y . Then an outcome yt in the outcome space 
Y is observed (which can be thought of as the correct value of Pt). Finally, at 
the end of the trial the learner suffers loss L{yt,yt), where L : F x F ^ [0,oo] 
is a fixed loss function. Since each pruning R has a prediction value for Pt, the 
loss of R at this trial is defined analogously. The goal of the learner is to make 
predictions so that its total loss L{yt,yt) is not much worse than the total 
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loss of the best pruning or that of the best mixture (convex combination) of 
prunings. 

It is straightforward to apply any of a large family of on-line prediction 
algorithms to our problem. To this end, we just consider each pruning R oi G 
as an expert that predicts as the pruning R does on all paths. Then, we can 
make use of any on-line algorithm that maintains one weight per expert and 
forms its own prediction ijt by combining the predictions of the experts using 
the current weights (See for example: [I ;itSS| lbW94l IVovhli IVovD.'il ICBFH+97] 
IK Wt)7] V Many relative loss bounds have been proven for this setting bounding 
the additional total loss of the algorithm over the total loss of the best expert 
or best weighted combination of experts. However, this “direct” implementation 
of the on-line algorithms is inefficient because one weight/expert would be used 
for each pruning of the decision dag and the number of prunings is usually 
exponentially large. So the goal is to find efficient implementations of the direct 
algorithms so that the exponentially many weights are implicitly maintained. 
In the case of trees this is possible iHsnzi. Other applications that simulate 
algorithms with exponentially many weights are given in [HPW99t IM Wh8j . We 
now sketch how this can be done when the decision dag is planar. 

Recall that each pruning and path intersect at one edge. Therefore in each 
trial the edges on the path determine the predictions of all the prunings as well as 
their losses. So in each trial the edges on the path also incur a loss and the total 
loss of a pruning is always the sum of the total losses of all of its edges. Under a 
very general setting the weight of a pruning is then a function of the weights of 
its edges. Thus the exponentially many weights of the prunings collapse to one 
weight per edge. It is obvious that if we can efficiently update the edge weights 
and compute the prediction of the direct algorithm from the edge weights, then 
we have an efficient algorithm that behaves exactly as the direct algorithm. 

One of the most important family of on-line learning algorithms is the one 
that does multiplicative updates of its weights. For this family the weight wr 
of a pruning R is always the product of the weights of its edges, i.e. wr = 
rieGfl The most important computation in determining the prediction and 
updating the weights is summing the current weights of all the prunings, i.e. 

We do not know how to efficiently compute this sum for arbitrary 
decision dags. However, for planar decision dags, computing this sum is reduced 
to computing another sum for the dual planar dag. The prunings in the primal 
graph correspond to paths in the dual, and paths in the primal to prunings in 
the dual. Therefore the above sum is equivalent to IleGP where P ranges 
over all paths in the dual dag. Curiously enough, the same formula appears as the 
likelihood of a sequence of symbols in a Hidden Markov Model where the edge 
weights are the transition probabilities. So we can use the well known forward- 
backward algorithm for computing the above formula efficiently jbH,S83) . 

The overall time per trial is linear in the number of edges of the decision dag. 
For the case where the dag is series-parallel, we can improve the time per trial 
to grow linearly in the size of the instance (a path in the dag) . 
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Another approach for solving the on-line pruning problem is to use the spe- 
cialist framework developed by Freund, Schapire, Singer and Warmuth fFSSWQfl . 
Now each edge is considered to be a specialist. In trial t only the edges on the 
path “awake” and all others are “asleep”. The predictions of the awake edges 
are combined to form the prediction of the algorithm. The redeeming feature 
of their algorithm is that it works for arbitrary sets of prunings and paths over 
some set of edges with the property that any pruning and any path intersect at 
exactly one edge. They can show that their algorithm performs nearly as well as 
any mixture of specialists, that is, essentially as well as the best single pruning. 

However, even in the case of decision trees the loss bound of their algorithm 
is quadratic in the size of the pruning. In contrast, the loss bound for the direct 
algorithm grows only linearly in the size of the pruning. Also when we use for 
example the EG algorithm nrwTTTi as our direct algorithm, then the direct 
algorithm (as well as its efficient simulation) predicts nearly as well as the best 
convex combination of prunings. 



2 On-Line Prnning of a Decision Dag 

A decision dag is a directed acyclic graph G = (V, E) with a designated start 
node s and a terminal node t. We call s and t the source and the sink of G, 
respectively. An s-t path is a set of edges of G that forms a path from the 
source to the sink. In the decision dag G, each edge e G if is assumed to have a 
predictor that, when given an instance (s-t path) that includes the edge e, makes 
a prediction from the prediction space Y . In a typical setting, the predictions 
would be real numbers from Y = [0, 1]. Although the predictor at edge e may 
make different predictions whenever the path passes through e, we write its 
prediction as ^(e) G F. 

A pruning i? of G is a set of edges such that for any s-t path P, R inter- 
sects P with exactly one edge, i.e., |i? n P| = 1. Let cnnp denote the edge at 
which R and P intersect. Because of the intersection property, a pruning R can 
be thought of as a well-defined function from any instance P to a prediction 
^(eflnp) S Y. Let P(G) and P(G) denote the set of all paths and all prunings of 
G, respectively. For example, the decision dag G in Fig.Q]has four prunings, i.e., 
TZ{G) = {{a},{b,c},{b, f,g},{d,e,g}}. Assume that we are given an instance 
P = {a,b,e}. Then, the pruning R = {b,f,g} predicts 0.6 for this instance P, 
which is the prediction of the predictor at edge b = cnap- 

We study learning in the on-line prediction model, where an algorithm is 
required not to actually produce prunings but to make predictions for a given 
instance sequence based on a given decision dag G. The goal is to make predic- 
tions that are competitive with those made by the best pruning of G or with 
those by the best mixture of prunings of G. We will now state our learning model 
more precisely. A prediction algorithm A is given a decision dag G as its input. 
At each trial t = 1, 2, . . ., algorithm A receives an instance/path Pt G P(G) and 
generates a prediction ijt €Y. After that, an outcome yt G Y is observed. F is a 
set called the outcome space. Typically, the outcome space F would be the same 
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as Y . At this trial, the algorithm A suffers loss L{yt, yt), where L : W x F ^ [0, oo] 
is a fixed loss function. For example the square loss is L{y, y) = {y — yY and the 
relative-entropic loss is given by L(y, y) = y\a.{y/fj) + (1 — y) ln((l — y)/(l — y)). 
For any instance-outcome sequence S = ((Pi, yi), . . . , (Pt, yr)) G (P(G) x F)*, 
the cumulative loss of A is defined as La{S) = what follows, 

the cumulative loss of A is simply called the loss of A. Similarly, for a pruning 
R of G, the loss of R for S is defined as 

T 

Lr{S) = '^L{yt,^{enr\Pt)) ■ 

t=i 

The performance of A is measured in two ways. The first one is to compare the 
loss of A to the loss of the best R. In other words, the goal of algorithm A is to 
make predictions so that its loss La{S) is close to minfl;g 7 j,(G) Lji{S). The other 
goal (that is harder to achieve) is to compare the loss of A to the loss of the 
best mixture of prunings. To be more precise, we introduce a mixture vector u 
indexed by R so that Ufl > 0 for P G R{G) and J2r '^R = 1- Then the goal of A 
is to achieve a loss La{S) that is close to min^ Lu{S), where 

T 

LAS) = T(yt, ^ URf(eRnPt)) ■ 
t=i Ren(G) 

Note that the former goal can be seen as the special case of the latter one where 
the mixture vector u is restricted to unit vectors (i.e., un = 1 for some particular 
R). 

3 Dual Problem for a Planar Decision Dag 

In this section, we show that our problem of on-line pruning has an equivalent 
dual problem provided that the underlying graph G is planar. The duality will 
be used to make our algorithms efficient. An s-t cut of G is a minimal set of 
edges of G such that its removal from G results in a graph where s and t are 
disconnected. First we point out that a pruning of G is an s-t cut of G as 
well. The converse is not necessarily true. For instance, the set {o, e, /} is an 
s-t cut of G in Fig.0but it is not a pruning because a path {a,d,e} intersects 
the cut with more than 1 edge. So, the set of prunings P(G) is a subset of 
all s-t cuts of G, and our problem can be seen as an on-line min-cut problem 
where cuts are restricted in TZ{G). To see this, let us consider the cumulative 
loss £e = J2feePt T(yt,^(e)) at edge e as the capacity of e. Then, the loss of a 
pruning R, Lr{S) = interpreted as the 

total capacity of the cut R. This implies that a pruning of minimum loss is a 
minimum capacity cut from TZ{G). 

It is known in the literature that the (unrestricted) min-cut problem for an 
s-t planar graph can be reduced to the shortest path problem for its dual graph 
(see, e.g., fHufiHl ILaw70l rhasSlj l. A slight modification of the reduction gives 
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us a dual problem for the best pruning (restricted min-cut) problem. Below we 
show how to construct the dual dag from a planar decision dag G that is 
suitable for our purpose. 

Assume we have a planar decision dag G = (V, E) with source s and sink 
t. Since the graph G is acyclic, we have a planar representation of G so that 
the vertices in V are placed on a vertical line with all edges downward. In this 
linear representation, the source s and the sink t are placed on the top and the 
bottom on the line, respectively (See Fig.|21). The vertical line (the dotted line) 
bisects the plane and defines two outer faces s ’ and t ’ of C?. Let s ’ be the right 
face. The dual dag G^ = {V^ ,E^) is constructed as follows. The set of vertices 

consists of all faces of G. Let e G E he an edge of G which is common to the 
boundaries of two faces fr and fi in G. By virtue of the linear representation, 
we can let fr be the “right” face on e and fi be the “left” face on e. Then, let 
E^ include the edge e' = {fr, fi) directed from fr to //. It is clear that the dual 
dag G^ is a planar directed acyclic graph with source s ’ and sink t ’ , and the 
dual of G^ is G. The following proposition is crucial in this paper. 

Proposition 1. Let G be a planar decision dag and G^ be its dual dag. Then, 
there is a one-to-one correspondence between s-t paths V{G) in G and prunings 
TZ{G^) in G^ , and there is also a one-to-one correspondence between prunings 
E{G) in G and s’-t’ paths V{G^) in G^ . 

Thus there is a natural dual problem associated with the on-line pruning 
problem. We now describe this dual on-line shortest path problem. An algo- 
rithm A is given as input a decision dag G. At each trial t = 1,2,..., algo- 
rithm A receives a pruning Rt G R{G) as the instance and generates a predic- 
tion yt & Y. The loss of A, denoted La{S), for an instance-outcome sequence 
S = {{Ri,yi), {RT,yT)) G {TZ{G)xY)* is defined asLA{S) = X)Li L{yt,yt)- 
The class of predictors which the performance of A is now compared to consists 
of all paths. For a path P of G, the loss of P for S is defined as 
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T 

Lp{S) = '^L{yt,^{eR^np)) ■ 

Similarly, for a mixture vector u indexed by P so that up > 0 for P G P{G) 
and '^pUp = 1, the loss of the w-mixture of paths is defined as 

T 

L^{S) = ^L{yt, up^iep^np)) ■ 

t=i Pev(G) 

The objective of A is to make the loss as small as the loss of the best path P, 
i.e., minp Lp{S), or the best mixture of paths, i.e., miuu Lu(S'). It is natural to 
call this the on-line shortest path problem because if we consider the cumulative 
loss £e = J2feeRt edge e as the length of e, then the loss of P, 

Lp{S) = can be interpreted as the total length of P. It is clear from 

the duality that the on-line pruning problem for a decision dag G is equivalent 
to the on-line shortest path problem for its dual dag G^. In what follows, we 
consider only the on-line shortest path problem. 



4 Inefficient Direct Algorithm 

In this section, we show the direct implementation of the algorithms for the on- 
line shortest path problem. Namely, the algorithm considers each path P of G 
as an expert that makes a prediction xt,p = ^{eRtOp) for a given pruning Rt- 
Note that this direct implementation would be inefficient because the number of 
experts (the number of paths in this case) can be exponentially large. 

In general, such direct algorithms have the following generic form: They main- 
tain a weight Wt^p G [0, 1] for each path P G P(G); when given the predictions 
xt,p{= 'C(eptnp)) of all paths P, they combine these predictions based on the 
weights to make their own prediction ijt, and then update the weights after the 
outcome yt is observed. In what follows, let Wt and Xt denote the weight and 
prediction vectors indexed by P G P(G), respectively. Let N be the number 
of experts, i.e., the cardinality of P(G). More precisely, the generic algorithm 
consists of two parts: 

— a prediction function pred : [J^ ^[0, 1]^ x Y which maps the cur- 

rent weight and prediction vectors {wt,Xt) of experts to a prediction yp, 
and 

— an update function update : IJ^ 0,1]^ X Y^ xY^ [Oj 1]’^ which maps 
(wtjXt) and outcome yt to a new weight vector Wt+i- 

Using these two functions, the generic on-line algorithm behaves as follows: For 
each trial t = 1, 2, . . ., 

1. Observe predictions Xt from the experts. 

2. Predict yt = pred(tut, a?i). 
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3. Observe outcome yt and suffer loss L{yt,ijt)- 

4. Calculate the new weight vector according to Wt+i = \ipd&te{wt,Xt,yt)- 

Vovk’s Aggregating Algorithm (AA) |Vovt)l)j is a seminal on-line algorithm of 
generic form and has the best possible loss bound for a very wide class of loss 
functions. It updates its weights as Wt+i = update(u;t, where 

Wt+i^P = Wt^pexp(-L(?/t,a;t,p)/cL) 

for any P G V{G). Here cp is a constant that depends on the loss function L. 
Since AA uses a complicated prediction function, we only discuss a simplified 
algorithm called the Weighted Average Algorithm (WAA) |K Whhj . The latter 
algorithm uses the same updates with a slightly worse constant cp and predicts 
with the weighted average based on the normalized weights: 

j/t = pred(tPt,a;t) = ^ u>t,pXt,p, where Wt,p = Wt^pj 

PeV{G) P' 

The following theorem gives an upper bound on the loss of the WAA in terms 
of the loss of the best path. 

Theorem 1 ( |KW99j ). Assume Y = Y = [0,1]. Let the loss function L be 
monotone convex and twice differentiable with respect to the second argument. 
Then, for any instance- outcome sequence S G {P-{G) x Y)* , 



Lwaa{S) < 



min 

pgv(G) 



{Lp{S) CL ln(l/wi,p)} 



where w\^p is the normalized initial weight of P. 

We can obtain a more powerful bound in terms of the loss of the best mixture 
of paths using the exponentiated gradient (EG) algorithm due to Kivinen and 
Warmuth |KW97] . The EG algorithm uses the same prediction function pred as 
the WAA and uses the update function Wt+i = upda.te{wt,Xt,yt) so that for 
any Pg1P(G), 



u>t+i,p = wppexp 



f dL{yt,z) 



Z=fit 



Here 77 is a positive learning rate. Kivinen and Warmuth show the following loss 
bound of the EG algorithm for the square loss function L{y, y) = [y — yff. Note 
that, for the square loss, the update above becomes Wt+i.p = Wt^p eyzp{—2rj{yt — 
yt)xt,p). 

Theorem 2 ( [KW97J I. Assume Y = Y = [0,1]. Let L be the square loss 
function. Then, for any instance- outcome sequence S G (fR-{G) x Y)* and for 
any probability vector u G [0, 1]^ indexed by P, 

Leg{S) < 7^L^{S) + iRE(ixllthi), 

2 - 7 ] 77 

where RE(Mjji(;i) = '^pUp\n.{up /wi,p) is the relative entropy between u and 
the initial normalized weight vector W\. 
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Now we give the two conditions on the direct algorithms that are required 
for our efficient implementation. 

Definition 1. Let w G [0,1]'^ and x G be a weight and a prediction vector. 
Let 7^1 U • • • U = 'h’iG) he a partition ofV{G) such that Vi H 7^^- = 0 for any 
i ^ j and all paths in the same class have the same prediction. That is, for each 
class Vi, there exists x[ GY such that xp = x[ for any P G Vi. In other words, 
x' = , x'ff) and w' = , w'^), where w[ = '^p> can be seen 

as a projection of the original prediction vector x and weight vector w onto the 
partition {V\, . . . ,Vk\- The prediction function pred is projection-preserving if 
pred(w,x) = pred^w' ,x') for any w and x. 

Definition 2. The update function update is multiplicative if there exists a 
function f : Y x Y x Y such that for any w G [0, 1]^, x G Y^ and y G Y, the 
new weight w' = update(ta, x, y) is given by w'p = wpf{xp,y,y) for any P, 
where y = pred(ti;, x). 

These conditions are natural. In fact, they are actually satisfied by the prediction 
and update functions used in many families of algorithms such as AA IVovUOI . 
WAA IKWhhI . EG and EGU IKWD7I . Note that the projection used may change 
from trial to trial. 



5 Efficient Implementation of the Direct Algorithm 



Now we give an efficient implementation of a direct algorithm that consists of 
a projection-preserving prediction function pred and a multiplicative update 
function update. Glearly, it is sufficient to show how to efficiently compute the 
functions pred and update. Obviously, we cannot explicitly maintain all of the 
weights Wt^p as the direct algorithm does since there may be exponentially many 
paths P in G. Instead, we maintain a weight Vt^e for each edge e, which requires 
only a linear space. We will give indirect algorithms below for computing pred 
and update so that, the weights Vt^e for edges implicitly represent the weights 
Wt.p for all paths P as follows: 



Wt,P 



eGP 



( 1 ) 



First we show an indirect algorithm for update which is simpler. Suppose 
that, given a pruning Rt as instance, we have already calculated a prediction 
ijt and observe an outcome yt. Then, the following update for the weight of the 
edges is equivalent to the update Wt+i — \rpda.te(wt,Xt,yt) of the weights of 
the paths. Recall that Wt is a weight vector indexed by P given by o and Xt is 
a prediction vector given by Xt^p = ■Cfoptnp)- Let / be the function associated 
with our multiplicative update (see Definition |2I). For any edge e G E, the weight 
of e is updated according to 



f 'Vt,ef{f.{e),yt,yt) if e e Rt, 

\ Vt^e otherwise. 



( 2 ) 
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Lemma 1. The update rule for edges given by 0) is equivalent to update(ii;t, 

xt,yt)- 

Proof. It suffices to show that the relation 0 is preserved after updating. That 
is, Wt+i,p = rieGP^t+i.e ^ ^ Let c' = ep^riP- Since update is 

multiplicative, we have 

wt+i,p = wt,pf{xt,p,yt,yt) = vt,ef{f.{e),yt,yt) 

eeP 

n (^t.e'/(C(e'),yt,2/t)) = n 

eGP\{e'} / eGP 




as required. □ 

Next we show an indirect algorithm for pred. Let the given pruning be Rt = 
{ei, . . . , Cfc}. For 1 < i < k, let Vi = {P G V{G) \ et G P}. Since \Rt n P| = 1 
for any P , ViU ■ ■ ■ UVk = V{G) forms a partition of V{G) and clearly for any 
path P G Vi, we have xt,p = ^{ci). So, 

a:' = (?(ei),---,C(efc)) (3) 

is a projected prediction vector of Xt- Therefore, if we have the corresponding 
projected weight vector w' , then by the projection-preserving property of pred 
we can obtain yt by pred(tu', a;'), which equals ijt = pred(ti;, a;). Now what we 
have to do is to efficiently compute the projected weights for 1 < i < k: 

wpi = = XI n 

PeVi P:eiGP PieiGPeGP 



Surprisingly, the ^ Jj[-form formula above is similar to the formula of the like- 
lihood of a sequence of symbols in a Hidden Markov Model (HMM) with a par- 
ticular state transition (e^) LHSH3I . Thus we can compute ® with the forward- 
backward algorithm. For node u G V, let Vs^u and Vu^e be the set of paths 
from s to the node u and the set of paths from the node u to t, respectively. 
Define 



a{u) = X n ^ X n ■ 

PGP=^„eGP PGP„^teGP 

Suppose that Ci = {ux,U 2 ). Then, the set of all paths in V{G) through Ci is 
represented as {Pi U {ci} U P 2 | Pi G Ps^„^,P 2 G P„ 2 ^t}, and therefore the 
formula du is given by 

= a{ui)vt,ei(i{u2) ■ ( 5 ) 

We summarize this result as the following lemma. 

Lemma 2. Let x' and w' be given by ^ and (Ej), respectively. Then 
pred(ii;', x') = pred(io, x). 
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The forward-backward algorithm [bB,S83| is an algorithm that efficiently com- 
putes a and /3 by dynamic programming as follows: a(u) = 1 if m = s and 
a(u) = I2u'eV:(n',u)GE<^('^'H,(u',u), Otherwise. Similarly, /3(u) = 1 if u = t and 
/5(w) = J2u'ev-(u u')eE f^W)'^t,(u,u')i otherwise. It is clear that both a and /3 can 
be computed in time 0{\E\). 

6 A More Efficient Algorithm for Series-Parallel Dags 

In the case of decision trees, there is a very efficient algorithm with per trial 
time linear in the size of the instance (a path in the decision tree) [IHSflTj . We 
now give an algorithm with the same improved time per trial for series-parallel 
dags, which include decision trees. 

A series-parallel dag G(s,t) with source s and sink t is defined recursively 
as follows: An edge (s, t) is a series-parallel dag; If Gi(si, ti), . . . , Gk{sk,tk) are 
disjoint series-parallel dags, then the series connection G(s,t) = s(Gi, . . . ,Gfc) 
of these dags, where s = si, ti = Si+i for 1 < i < k—1 and t = t^, or the parallel 
connection G(s, t) = p(Gi, . . . , Gk) of these dags, where s = si = • • • = and 
t = ti = • • • = tfc, is a series-parallel dag. Note that a series-parallel dag has a 
parse tree, where each internal node represents a series or a parallel connection 
of the dags represented by its child nodes. 

It suffices to show that the projected weights (@J can be calculated in time 
linear in the size of instance/pruning Rf. To do so the algorithm maintains one 
weight vt^G per one node G of the parse tree so that 

Pev{G) eeP 

holds. Note that if G consists of an single edge e, then vt^a = Pt,e] if G = 
s(Gi,...,Gfc), then vt,a = ]li=i if G = p(Gi, . . . , Gfe), then vt,c = 

ELi^t.G.- Now 0, i.e., W{G,e) = EpG-p(G),eGP Ile'GP is recursively 
computed as 

{ Vt,e if G consists of e, 

W{Gi,e)vt^G/vt,Gi if G = s(Gi,...,Gfc) and e e G^, 
W{G,,e) ifG = p(Gi,...,Gfe) andeeG,. 

The weights Vt^G are also recursively updated as 

{ Vt+i^e if G consists of e, 

Vt+i,GiVt,G/vt,Gi if G = s(Gi, . . . , Gfc) and Rt is in Gi, 
Eti^i+i.G. ifG = p(Gi,...,Gfe). 

It is not hard to see that the prediction and the update can be calculated in 
time linear in the size of Rt- 

Note that the dual of a series-parallel dag is also a series-parallel dag that 
has the same parse tree with the series and the parallel connections exchanged. 
So we can solve the primal on-line pruning problem using the same parse tree. 
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Abstract. We present efficient on-line algorithms for learning unions of 
a constant number of tree patterns, unions of a constant number of one- 
variable pattern languages, and unions of a constant number of pattern 
languages with fixed length substitutions. By fixed length substitutions 
we mean that each occurence of variable Xi must be substituted by ter- 
minal strings of fixed length l{xi). We prove that if an arbitrary unions 
of pattern languages with fixed length substitutions can be learned effi- 
ciently then DNFs are efficiently learnable in the mistake bound model. 
Since we use a reduction to Winnow, our algorithms are robust against 
attribute noise. Furthermore, they can be modified to handle concept 
drift. Also, our approach is quite general and may be applicable to learn- 
ing other pattern related classes. For example, we could learn a more gen- 
eral pattern language class in which a penalty (f.e. weight) is assigned 
to each violation of the rule that a terminal symbol cannot be changed 
or that a pair of variable symbols, of the same variable, must be substi- 
tuted by the same terminal string. An instance is positive iff the penalty 
incurred for violating these rules is below a given tolerable threshold. 



1 Introduction 

A pattern p is a string in (TUS)* for sets T of terminal symbols and S of variable 
symbols. The number of terminal symbols could be infinite. For a pattern p, let 
C{p) denote the set of strings from T+ that can be obtained by substituting non- 
empty strings from T"*' for the variables in p. We call £(p) the pattern language 
generated by p. The strings in £(p) are positive instances and the others are 
negative instances. For example, p = Ia;i0a;2la;3a;i0a;2 is a 3-variable pattern. 
The instance 11010111001101011 is in C{p) since it can be obtained by the 
substitutions X\ = 101, 0:2 = H,a ;3 = 001. 

Pattern languages were first introduced by Angluin Since then, they 

have been extensively investigated in the identification in the limit framework m 
^ EU El El EHI EDI [H ^ OS EEI • They have also been studied in the PAG 
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learning ^11201 and exact learning ^2 El |221 E3 frameworks. They are 
applicable to text processing m, automated data entry systems BU, case-based 
reasoning m and genome informatics |ZllHll3[T3E3IIli2I 

Learning general pattern languages is a very difficult problem. In fact, even if 
the learner knows the target pattern, deciding whether a string can be generated 
by that pattern is IVP-complete 0 E]. Ko and Tzeng m showed that the 
consistency problem of pattern languages is ifl^-complete. Schapire m proved 
a stronger result. He showed that pattern languages cannot be learned efficiently 
in the PAC-model assuming P/poly ^ NP/poly regardless of the representation 
used by the learning algorithm. In the exact model, Angluin [0| proved that 
learning with membership and equivalence queries requires exponential time. 

A natural approach in making pattern languages learnable is to restrict the 
number of occurrences of each variable symbol in the pattern to one im or at 
most some constant k Another approach is to bound the number of variables 
by some constant (though there is no restriction on the number of times each 
variable symbol can be used) . Kearns and Pitt m gave a polynomial-time PAC- 
learning algorithm for learning such k-variable patterns under the assumption 
that examples are drawn from a product distribution. However, for arbitrary 
distributions, the problem seems to be difficult even if A: = 2 BHI 3 . We present 
an efficient algorithm that does not place any restrictions on k or the number 
of times each variable symbol occurs (albeit at the cost of only allowing fixed 
length substitutions). Furthermore, we can also learn the union of a constant 
number of patterns even with attribute noise. 

For fc = 1, Angluin B1 presented a learner that produces a descriptive pattern 
in 0{l^ log 1) update time, where I is the length of all the examples seen so far. A 
pattern p is said to be descriptive if given a sample S that can be generated by p, 
no other pattern that generates S can generate a proper subset of the language 
generated by p. Erlebach et. al. cni gave a more efficient algorithm that outputs a 
descriptive pattern in expected total learning time 0{\p\^ log |p|) where \p\ is the 
length of the target pattern p. Recently, Reischuk and Zeugmann m proved that 
if the sample S is drawn from some fixed distribution satisfying certain benign 
restrictions and the learner is not required to output a descriptive pattern, then 
one can learn one- variable patterns with expected total time linear in the length 
of the pattern while converging within a constant number of rounds. 

In their paper, Reischuk and Zeugmann m suggested several research di- 
rections in learning one- variable patterns. First, they pointed out that even with 
two variables (i.e. k = 2) the situation becomes considerably more complicated 
and will require additional tools. One open problem they suggested is to con- 
struct efficient algorithms for learning unions of constant number of one- variable 
pattern languages. In Section El we present an efficient algorithm to learn the 
union of L one-variable pattern languages in the mistake bound model. Our al- 
gorithm tolerates attribute errors but requires the learner be given one positive 
example, which does not contain attribute noise, for each pattern. The number 
of attribute errors of a labeled string (s,y), with respect to a target pattern, 
is the number of (terminal) symbols of s that have to be changed so that the 
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classification of the resulting string by the target pattern is consistent with y. 
The update time is polynomial in the length of the noise-free positive example 
of each pattern, and the current instance that we want to classify. However, it 
is exponential in L. When L = 1 our algorithm is less efficient than Reischuck 
and Zeugmann algorithm. However, our analysis is a worst-case analysis which 
does not assume the sample is drawn from a fixed distribution. It also tolerates 
concept drift. 

A concept class that closely resembles pattern languages is the class of tree 
patterns. A tree pattern p is a rooted tree where the internal nodes are labeled 
using a set T of terminal symbols while the leaves may be labeled using T or a 
set S of variable symbols. An instance t is a “ground’ tree if all the nodes are 
labeled by terminal symbols. An instance t is in the language C{p) generated by 
a tree pattern p if t can be obtained from p by substituting the leaves labeled 
with the same variable symbol by the same ground tree. Those tree patterns 
where the siblings are distinguishable from each other are referred to as ordered 
and otherwise as unordered. A union of ordered (resp. unordered) tree patterns 
is called an ordered forest (resp. unordered forests). In this paper, we consider 
only ordered trees and forests. For recent results on learning unordered forests 
see Amoth, Cull and Tadepalli [3|. 

The study of tree patterns is motivated by natural language processing uni 
and symbolic integration m where instances are represented as parse trees 
and expressions |^, respectively. Tree patterns are also closely related to logic 
program representations P3E3|. Using the exact learning model with member- 
ship and equivalence queries, Arimura, Ishizaka and Shinohara m showed that 
ordered forests with bounded number of trees can be learned efficiently. Sub- 
sequently, Amoth, Cull and Tadepalli |2| showed that ordered forests with an 
infinite alphabet are exactly learnable using equivalence and membership queries. 
They also showed that ordered trees are exactly learnable with only equivalence 
queries. We give an efficient algorithm to learn unions of a constant number of 
ordered tree patterns (in the mistake bound model without membership queries) 
in the presence of attribute noise. The number of attribute errors of a labeled 
ground tree (t,y), with respect to a target pattern, is the number of (terminal) 
symbols in the nodes of t that have to be changed so that the classification of 
the resulting tree by the target tree pattern is consistent with y. Our algorithm 
does not require any restrictions on the alphabet size for the terminal symbols 
or on the number of children per node. 



2 Our Results 

In this paper, we present algorithms to learn unions of pattern languages and tree 
patterns. We obtain all of our algorithms by reductions of the following flavor. We 
introduce two sets of boolean attributes. One set is to ensure that the terminal 
symbols have not been changed. The other set is for ensuring all variable symbols 
are substituted properly. The target concept is then represented as a conjunction 
of a relatively small number of these attributes. More specifically, the number of 
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relevant attributes depends only on the number of patterns in the target union, 
the number of variables in the patterns and the number of occurrences of the 
variable symbols in the patterns. We achieve this goal while keeping the total 
number of attributes polynomial in the length of the examples (which could 
be arbitrarily longer than the number of variable symbols). Furthermore, since 
the target concept is represented as a conjunction of boolean attributes, we 
can employ Winnow to obtain a small mistake bound and to handle attribute 
noise. Finally, since a disjunction of a constant number of terms can be reduced 
to a conjunction (with size exponential in the number of terms) we can use 
our technique to learn unions of a constant number of patterns. This approach 
seems to be quite general and was employed to learn geometric patterns [T7|. It 
is possibly applicable to learning other pattern related concept classes as well. 

In Section 0 we apply our technique to learn a union of a constant number 
of pattern languages with the only restriction being that there are fixed length 
substitutions. A pattern language C{p) is said to have fixed length substitutions 
if each variable Xi can only be substituted by terminal strings of constant length 
l{xi). The constant l{xi) depends only on Xi and can be different for different 
variables. Trivially, this means that all strings in C{p) must be of the same 
length. The resulting algorithm learns a union of pattern languages C{pi), ..., 
C{pl) with fixed length substitutions using polynomial time (for L constant) for 
each prediction and with a worst-case mistake bound of 

O I I J]^(2Vj -h + 1) j I ^logn* j -I- ^J]^min {2Aj,2V^ ~h + l) 
yvi=i / \i=i / i=ii=i 

where ki is the number of variables in pi, Vj is the total number of occurrences 
of variable symbols in pi, Aj is the worst-case number of attribute errors in 
trial j, and Ui is the length of the given positive example for pi (which must 
have no attribute errors). Note that the mistake bound only has a logarithmic 
dependence on the length of the examples. In addition, we could assign a penalty 
(i.e. weight) to each violation of the rule that a terminal symbol cannot be 
changed. The weights can be different for different terminal symbols. Similarly, 
we can also assign a penalty to each violation of the rule that a pair of variable 
symbols, of the same variable, must be substituted by the same terminal string. 
If the penalty incurred by an instance for violating these rules is below a given 
tolerable threshold then it is in the target concept C'{p) generated by p. If the 
penalty is above the threshold then it is not in C'{p). Since Winnow can learn 
linear threshold functions, the algorithms we present here can be extended tot 
this more general class of pattern languages. 

Contrasting this positive result, we prove that if unions of an arbitrary num- 
ber of such patterns can be learned efficiently in the mistake bound model then 
DNFs can be learned efficiently in the mistake bound model. Whether or not 
DNF formulas can be efficiently learned is one of the more challenging open 
problems. The problem remains open even for the easier PAC learning model. 
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Next, in Section 0, we present an algorithm to learn £(pi) U • • • U £(pl) 
where each pi is a one- variable pattern. Our algorithm makes each prediction in 
polynomial time (for constant L) and has a worst-case mistake bound of 



where Vi is the number of occurrences of the variable symbol in pi, Aj is the 
worst-case number of attribute errors in trial j, rii is the length of the first 
positive example for pi (which must have no attribute errors) and m is the 
length of the example to be classified. 

In Section El we apply our technique to obtain an algorithm to learn ordered 
forests composed of trees patterns pi, ...,pl. Our algorithm makes each predic- 
tion in polynomial time (for constant L) and has worst-case mistake bound 



where Aj is the worst-case number of attribute errors in trial j, and rii is the 
length of the first positive example for pi (which must have no attribute errors) . 
As in the case of pattern languages with fixed length substitutions, it has been 
shown that in the exact learning model with equivalence queries only, efficient 
learnability of ordered forests implies efficient learnability of DNFs |2|. Thus, it 
seems unlikely that unions of arbitrary number of ordered tree patterns can be 
learned in the mistake bound model. 

In all of our algorithms, the requirement that the learner is initially given a 
noise free positive example for each pattern or tree pattern in the target can be 
relaxed. One way is to sample the instance space for positive labeled instances. 
If the attribute noise rate is low then with high probability, we can obtain one 
noise-free positive example for each (tree) pattern unless the positive instances 
for a particular (tree) pattern do not occur frequently. In the latter, we can ignore 
that (tree) pattern. We can then run our algorithms for each L-subset of these 
positive examples and use the weighted majority algorithm ISDl of Littlestone 
and Warmuth to “filter out” the optimum algorithm. 

3 Preliminaries 

In concept learning, each instance in an instance space X is labeled according to 
some target concept /. The target concept is assumed to be some concept class C. 
The model used here is the on-line (a.k.a. mistake-bound) learning model El- 
In this model, learning proceeds in a, possibly infinite, sequence of trials. In each 
trial, the learner is presented with an instance Xt from some domain X . The 
learner is required to make, in polynomial time, a prediction on the classification 
of Xt- In return, the learner receives the desired output f{Xt) as feedback. 




L \ / ^ \ T L 

n 2Ui I ( X! ) + X! n ™ (2Aj, 2U,) 
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A mistake is made if the prediction does not match the desired output. The 
learner’s objective is to minimize the total number of mistakes. 

An important result in this model is Littlestone’s algorithm Winnow for 
learning small conjunctions (or disjunctions) of boolean attributes when there is 
a large number n of irrelevant attributes. Winnow maintains a linear threshold 
functions ^ ^ where Wi is a weight that is associated with the boolean 

attribute Xi. Initially, all the weights are equal to 1. Upon receiving an input 
(ui, • • •, Vn), the algorithm predicts true if the sum greater than the 

fixed threshold 0 and false otherwise. Typically, the threshold is set to n. 

If the prediction is wrong then the weights are updated as follows. Suppose 
the algorithm predicts false but the instance is in the target concept. Winnow 
promotes the weight Wi, for each attribute Xi in the instance that is set to 1, 
by multiplying Wi by some constant update factor a for a > 1 (typically, we set 
a = 2). Otherwise, the algorithm must have predicted true but the instance is 
not in the target concept. In this case, for each literal Xi in the instance that is 
set to 1, Winnow demotes the weight Wi by dividing it by a. 

The number of attribute errors of a labeled example (At, yt), with respect to 
the target disjunction, is the number of attributes of At that have to be changed 
so that the classification of the resulting example by the target is consistent with 
yt- In the presence of attribute noise, Littlestone offers the following performance 
guarantee for Winnow. 

Theorem 1. IMi Suppose, the target concept is a k-conjunction (or k-disjunc- 
tion) and makes at most A attribute errors. Then Winnow makes at most 
0{A + klog{N)) mistakes on any sequence of trials. 

Auer and Warmuth suggested a version of Winnow which tolerates con- 
cept drift. Here the target disjunction may drift (change slowly) in time. The 
idea is that when a weight is sufficiently small, we do not demote it any further. 
We restrict our discussion in this paper to the original version of Winnow but 
remark that we could use the drift-tolerant version of Winnow to yield results 
that tolerates shifts (details omitted). 



4 Learning Unions of Pattern Languages with Fixed 
Length Substitutions 

Although general pattern languages are difficult to learn, we prove the following 
theorem which states that if the target concept is a union of L (a constant 
number of) pattern languages that have fixed length substitutions then we can 
learn it efficiently in the on-line model with the presence of attribute noise. 
We note that for the case of a single pattern with fixed length substitutions 
without any attribute noise, one can use a direct application of the halving 
algorithm [M Eg to obtain an algorithm with a polynomial mistake bound. 
Along with the restrictions mentioned above, when directly using the halving 
algorithm exponential time is required to make each prediction. The algorithm 
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we present handles a union of a constant number of patterns, is robust against 
attribute noise, and each prediction is made in polynomial time. 

We first prove our result for the case where the target is a single pattern with 
no attribute noise. Then, we generalize our result to unions of patterns in the 
presence of attribute noise. 

Lemma 2. Suppose the target concept is the pattern language L{p) with fixed 
length substitutions. Further, suppose that p is composed of variables x\, . . . ,Xk 
with V total occurrences of the variable symbols in p. Then the target concept 
can be efficiently learned in the mistake bound model with O ((2V — A: + 1) logn) 
mistakes in the worst case. The time complexity per trial is O (n^) where n is 
the length of the first (positive) counterexample. 

Proof. Our algorithm obtains its first positive counterexample by predicting neg- 
ative until it gets a positive counterexample sq. Let n be the length of sq. Since 
all substitutions of the same variable in the target have the same length, we 
know that if an instance has length different from n then it is a negative in- 
stance. Thus, without loss of generality, we assume all the instances are exactly 
of length n. We denote the substring of a string s that begins at position i and 
ends at position j by s[i,j] and the zth symbol of s is denoted by s[z]. To make a 
prediction on an instance s, we transform s to a new instance with the following 
sets of boolean attributes: 

— X[i,j, l],l < i < j < n,l < I < n — j + 1. Each variable X[i,j, is set to 1 
if and only if the two substrings s[z, i + l—1] and s[j,j -I- ? — 1] are identical. 

— C[i,j],l < i < j < n. The variable C\i,j] is set to 1 if and only if the 
substrings s[i,j] and so[z,j] are the same. 

We note that our reduction is a refinement of a more direct reduction that 
uses O(n^) variables (versus the O(n^) variables used above) for the case where 
the length of the substitutions must always be one. The following claim shows 
that by introducing the ^ -I- o(n^) variables of the form X[i,j,l], the target 
concept can be represented as a conjunction where the number of relevant vari- 
ables is independent of n (versus having a linear dependence on n) . By applying 
Winnow to learn this conjunction, we obtain a mistake bound with a logarithmic 
dependence on n versus a linear dependence on n. 

Claim. The target fc-variable pattern p can be expressed in the transformed 
instance space as a conjunction of 2V — k + 1 attributes. Here, V is the total 
number of occurrences of the variable symbols in p. 

Proof. Since the substitutions of the same variable x must be of the same length 
l{x), the substitution of a particular variable symbol in all positive instances 
must appear in the same locations. That is, for the variable symbol x to appear 
in a particular location in p, its substitution in a positive instance must appear 
in position i to i + l{x) — 1 for some fixed i. The substitutions for a variable x 
that appears in two distinct positions i and j are the same iff X[i,j, Z(a;)] = 1. 
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Consider a particular variable, say Xi. Suppose that Xi appears in a positive 
instance at positions j\ < ... < jcti- Then for an instance to be positive, the 
(at — 1) transformed variables X[ji, j 2 ,l{xi)], . . . , X[jai-i, jai,l(^i)] must all be 
set to 1. Conversely, if one of these transformed variables are set to 0 then the 
instance must be negative. 

Further, suppose so[i, j] is a substring of sq that corresponds to a maximal 
substring in p consisting of only terminal symbols. In other words, all symbols in 
so[i,i] are terminal symbols in p, but so[*~ 1 ] and so[j + 1 ] are symbols obtained 
from substituting a variable with a string of terminal symbols. Notice again that 
the substitution of a variable symbol must appear at a specific location and be 
of the same length. Therefore, for an instance s to be positive, the substring 
s[i,j] and sq[i, j] must be the same. The latter means that C[i,j] must be set to 
1 . Conversely, if for some so[i,j] that corresponds to a maximal substring in p 
consisting of only terminal symbols, the substring s[i,j] of some instance s does 
not match so[i, j] (i.e. C[i,j] = 0) then s must be negative. There are at most 
C + 1 of the C[i,j]’s that are positive (since each one, except the last, must end 
with one of the V variables). 

A positive instance s is positive if and only if (1) all the variables of the same 
variable symbol are substituted by the same strings of terminal symbols and ( 2 ) 
none of the substrings in p consisting of terminal symbols only are substituted. 
The above discussion implies that (1) and (2) can be ensured by checking at 
most — l) = V — k variables X[i,j, l]’s and V + 1 variables C[i,j]’s are 

all Is, respectively. □ 

Consider the pattern p = xilxzQlx 2 QQ^X\X 2 ^^Xi with l{x\) = 3,l{x2) = 
4, /(X 3 ) = 2 as an example. The proof of the above claim says that it can be 
represented as the conjunction 

(C[4, 4] A C[7, 8] A C[13, 15] A C[23, 24]) /\ {X[l, 16, 3] A A[16, 25, 3]) f\X[9, 19, 4] 



The variable Xi must appear at position 1, 16 and 25. Thus, the target 
conjunction must contain the variable A[l, 16, 3] and A[16, 25, 3]. The substring 
s[4,4] of any positive instance s must always be the same as sq[4, 4] and is the 
string “1”. Thus, C[4,4] must be present. The presence of other attributes can 
be similarly explained. 

From the above claim we know that there are at most 2V — k + 1 relevant 
attributes. Combined with the fact that there are O(n^) boolean attributes, 
we obtain the desired mistake bound of Lemma El by applying Winnow. (Since 
Winnow learns a disjunction of boolean attributes, we apply Winnow to learn 
the negation of the target concept which is represented as a disjunction of the 
negations of the attributes.) A straightforward implementation of the above idea 
would have time complexity of O(n^) per trial. To reduce the time complexity, 
for each distinct pair of i and j, 1 < i < j < n, the learner first finds the 
longest common substring of the string that begins at position i and the string 
that begins at position j. Say the common substring is of length I' . Then the 
learner sets all variables X[i,j,l],l < I < I' to 1 and X[i,j,l],l > I' to 0. The 
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C[z, j]’s can be evaluated in a similar way. This implementation reduces the time 
complexity to O(n^) per trial. This completes the proof of Lemma 0 □ 

We now extend this result for the case of a union of a constant number of 
patterns with fixed length substitutions under attribute noise. 

Theorem 3. Suppose the target concept is a union of pattern languages C{pi), 
C{pl) with fixed length substitutions. Further, suppose that for 1 < i < L, 
Pi has ki variables and Vi total occurrences of variable symbols. Then the target 
concept can be efficiently learned in the mistake bound model. The number of 
mistakes made after T trials is bounded by 

O I (l[{2V.-h + l)] [5] logn* j + tti min {2Aj, 2Vi — ki + 1) 

\\i=l / \i=l / i=li=l 

in the worst case. The time complexity per trial is O ((ni...ni)^) . We assume 
that initially the learner is given a noise-free positive example, of length Ui, for 
each pattern Pi. Here, Ai is the number of attribute errors in the trial. (For 
this bound to be meaningful, we assume Ai is zero in most of the trials.) 

Proof. First we consider the case where the target is a union of L patterns 
satisfying the condition of Theorem 01 but with no attribute noise. In this case, 
each pattern pi can be represented as a conjunction Ci of 2Vi — ki~\-l attributes. 
The target is a disjunction / of the Cfs. Thus, its complement can be represented 
as a Y\^^i{2Vi — + l)-term DNF which we denote by /'. The term in / must 

contain exactly one literal from the set of transformed attributes corresponding 
to a pattern pi, i = 1,...,L. Since there are at most 0{n)) attributes for each 
pattern pi, there are at most 0{{ni...nif)^) possible terms to consider. Each such 
candidate term can be treated as a new attribute. Applying Winnow would then 
give us the desired mistake bound. Further, as before, the transformed attributes 
corresponding to the pattern pi can be computed in 0{n)) time. Thus, the time 
complexity to update the 0((ni...ni)^) attributes is 0((ni...ni)^). 

Finally, we introduce attribute errors. Suppose Aj symbol errors occur at 
trial j. Each symbol error can result in at most two relevant attributes of Ci 
being complemented. There are at most 2Vi — ki~\-l literals in Ci. Thus, at most 
mm{2Aj,2Vi — ki 1) of the attributes in Ci are complemented. This implies 
that at most min(2A^-, 2Vi~ ki~\-l) attributes in /' are complemented. This 
gives us the second term in the mistake bound that is due to attribute errors. □ 
The next theorem suggests that it appears necessary to bound the number 
of patterns in the target for it be efficiently learnable. 

Theorem 4. In the mistake bound model, if unions of arbitrary number of pat- 
tern languages with fixed length substitution restriction can be learned efficiently, 
then DNFs can be learned efficiently. 

Proof. Suppose the learner is asked to learn a DNF / in the mistake bound 
model. Without loss of generality, we can assume / is monotone and there are n 
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variables x\^ Let {0, 1} and {ai, be sets of terminal and variable 

symbols, respectively. Each term t in f can be represented as a pattern p(t) with 
n characters. The i**' character is set to 1 if the literal Xi is in term t and ai 
otherwise. We represent an instance x as an n-bit vector (string). If we restrict 
l{ai) = 1 for all a^’s then clearly, t{x) = 1 iff a: G C{p{t)). This is a polynomial- 
time prediction preserving reduction which completes the proof. □ 

5 Learning Unions of One- Variable Pattern Languages 

We now consider the case of one- variable patterns without the fixed length sub- 
stitution requirement. As in the last section, we first prove our result for the case 
where the target is a single pattern with no attribute noise. Then, we apply the 
same technique to generalize this result to unions of patterns in the presence of 
attribute noise. 

Lemma 5. Suppose the target concept is a one-variable pattern p with V occur- 
rences of the variable symbol. Then the target concept can be efficiently learned 
in the mistake bound model with O (Vlogn) mistakes in the worst case. The time 
complexity per trial is O {nfmV^ = O (n^mfj where n is the length of the first 
(positive) counterexample and m is the length of the example to be classified. 

Proof. The learner guesses negative until obtaining a positive counterexample 
So- Denote the length of sq by n, and the starting positioiJl of the i^^ (counting 
from the leftmost end of the pattern) substitution of the variable x by ai. For 
a moment, we assume the learner is told the number of occurrences V of the 
variable symbols in the target, and length £ of the substituted terminal string. 

Suppose the learner is asked to classify a given unlabeled instance s of length 
m. If the difference in length of sq and s is not divisible by V then we can 
conclude immediately that s must be classified negative. Henceforth, we assume 
the difference between the lengths of sq and s is divisible by E. If s is positive 
then the substitution of a; in s has length £' = £ -\- The substitution 

of a; in s must begin at location a' = at -\- {i — 1)^^^^ and the substitution 
for the variable x is the substring s[a', at-\-£' — V\. In other words, to see if all 
substitutions of a; in s are the same, we simply check for all i = 2 , ..., V , whether 
s [o' _ 1 , a'_ 1 -I- — 1] = s[ai,ai-\-£' — 1]. If this is not so then we can immediately 
conclude that s ^ C,{p). 

Unfortunately, we do not know the afs. To circumvent this problem, we 
introduce new attributes A[/3, 7 , z], 2 < /3 < 7 < n, 1 < * < U such that A[/3, 7 , i] 
is set to true if and only if the substring s[/3-|- (z — 1)^^^^^^, /3-l- (z — 1)^^^^ - 1 -^' — 1] 
is the same as 5(7 -I- 7 -I- -\-£' — 1]. Clearly, if all substitutions of x in 

s are the same then the (V — l)-conjunction 



Cx — A [oi , 02 , 2 ] A ... A X[ai—i, ai,i] A ... A X [ay—i, ay, V] 
is satisfied, and vice versa. 

^ These positions are not known to the learner. 
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To classify an instance correctly as positive, we also need to ensure that the 
terminal symbols in p remain the same. Let ao = —£ and ay+i = n. Then 
clearly, the substring of terminal symbols between the variable symbol 
and i + 1®* variable symbol is the string so[o!i + ■^, Oi+i — 1] (which is defined to 
be the empty string if + £ > Oi+i — 1). If none of the terminal symbols in this 
maximal substring of terminal symbols is changed in s then it must appear in 
s[a\ + 1", — 1]. In other words, to check if none of the terminal symbols in 

the target has been replaced, it is sufficient and necessary to verify that 

s[a' + - 1] = so[oi + £, - 1] Vz = 0, ..., V (1) 

As before, since we do not know where the Oi’s are, we introduce new at- 
tributes C[i,B,E],Q < i < V,1 < B < E < n. We set C[i,B,E] to 1 when 
s[B + i ,E + i ^22^] = so[B,E], It is easy to verify that saying Equation 0 is 
satisfied is the same as saying the conjunction Ct (shown below) is satisfied. 

V" 

Ct = /\ C[i, ai + £, - 1] 

Therefore, the target pattern p can be represented as a conjunction Ct A Cx 
of 2U boolean attributes. There are 0{n?V) possible attributes to consider. Thus 
by running Winnow to learn Ct£\Cx guarantees at most 0{2V (log rz-|-log V)) = 
0{2V logn) mistakes are made (since U < rz). 

The question remains in guessing £ and V correctly. Well there are only O(n^) 
such guesses. We can run one copy of the above algorithm for each guess and 
run weighted majority algorithm |,3()j on these algorithms. The mistake bound is 
0{log{n^) + 2Vlogn) = O(Ulogrz) with running time 0{n‘^mV) = 0{n^m). □ 

Lemma can be extended to learn unions of one- variable pattern languages 
in the presence of attribute noise (except for the first counterexample which 
must be noise free). The bound obtained is shown in the next theorem. 

Theorem 6. Suppose the target concept p is a union of one-variable pattern 
languages C{pi), L{pl)- Further, suppose the number of occurrences of the 
variable symbol inpi is Vj. Then p can be efficiently learned in the mistake bound 
model. The number of mistakes made after T trials is bounded by 

VVz^i / \i^i / / 

in the worst case. The time complexity per trial is O (^m{ni...nL)‘^ Tf=i = 
O (m(zzi...zzi)®) . Here, 



— We assume that initially the learner is given a noise-free positive example, 
of length Ui, for each pattern pi. 

— m is the length of the unlabeled example to be classified in the T -\- 1®* trial. 
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— Ai is the number of attribute errors in the trial. (For this bound to be 
meaningful, we assume that Ai is zero in most of the trials.) 

Proof. (Sketch) We obtain this result by extending Lemma 0 to unions of lan- 
guages with attribute noise using the same technique as that used in extending 
Lemma 121 to Theorem 0 □ 

6 Learning Ordered Forests 

We have demonstrated how the problem of learning unions of pattern languages 
can be reduced to learning conjunctions of boolean attributes. Next, we apply 
this idea to learning ordered forests with bounded number of trees. No restric- 
tions are needed on the number of children per node or the alphabet size for the 
terminal symbols. 

Theorem 7. Ordered forests composed of trees patterns p\,...,pL can be effi- 
ciently learned in the on-line model. The number of mistakes made after T trials 
is bounded by 



o [ (l[\p^\] fe logn* j -f En™ (2Aj, \pi\) 

\\i=l / \i=l / i=li=l 

in the worst case. The time complexity per trial is O ((ni...riL)^) . Here, 

— We assume that initially the learner is given a noise-free positive example, 
of length ni, for each tree pattern pi. 

— Aj is the number of attribute errors in the (j)th trial. (For this bound to be 

meaningful, we assume that Aj is zero in most of the trials.) □ 

Proof. (Sketch) We present only the proof for the case of learning a single ordered 
tree pattern. The extension of the proof to the case of learning ordered forests 
in the presence of attribute errors is like that used to prove Theorem El 

Suppose t is a tree and it is a node in t. Let path((u) denote the labeled path 
obtained by traversing from the root of t to u. Given two distinct trees t and 
t', we say pathj(u) = path(,(u') if and only if the sequences of the node labels 
(except for the last) and the branches taken as we traversed from the root of 
t to u and from the root of t' to u' are the same. As before, we simply keep 
predicting negative until we get a positive counterexample to- Let n denote the 
number of nodes in to- 

To make a prediction on an instance t, we transform t to a new instance with 
the following set of Ofn?) attributes (See Figured for an illustration). 

— For each vertex uq in to, we introduce a new attribute C[uq]. This attribute 
is set to 1 if and only if pathj^(uo) = path((u) for some node u in t, and the 
labels of the nodes u in t and uq in to are the same. 
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a (positive) instance t^^ that 
can be generated by p 



Fig. 1. The figure on the left shows a tree pattern p. The figure on the right is 
a tree instance to that can be generated by p. If to is the first counterexample 
obtained then the conjunctive representation of p is C[l] A C[2] A C[3] A C[6] A 



— For each distinct pair of nodes uq and vq in to, we introduce a new attribute 
X[ito,'yo]- is set to 1 if and only if there are two distinct nodes u 

and u in t that satisfies: 

1. path(^(uo) = path((u) and path(jj(uo) = path((u) 

2. The two subtrees in t that are rooted at u and v are identical. (Since the 
siblings are distinguishable, we can check that the subtrees are identical 
in linear time). 

Let f be the new instance with (”) boolean attributes obtained by the above 
transformation . 

Claim. The target tree pattern p can be represented as a conjunction / of at most 
\p\ of the new boolean attributes such that given an instance t, the transformed 
instance t' is classified positive by / iff t is classified as positive by p. 

Proof. To verify an instance t is in C{p), it is necessary and sufficient to ensure 
the following two conditions are satisfied. 

1. For each node u in p that is labeled by a terminal symbol, there is a corre- 
sponding node u in t such that pathp(u) = path((u) and both u and u have 
the same terminal label. 

2. For each pair of distinct leaves u and u in p labeled by the same variable, there 
are two nodes u and u in t such that pathp(u) = path((u) and pathp({j) = 
path((u). Furthermore, the subtrees in t rooted at u and v are identical. That 
is, the substitutions in t for u and v are the same. 



X[4,7]. 
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Clearly, pathp(u) = pathj(M) is equivalent to pathj^(uo) = path((u) for the 
nodes uq in to that corresponds u in p. Condition 1 is satisfied if and only if 
for each node uq in to that corresponds to a node in p labeled using a terminal 
symbol, the attribute C\uo] is set to 1. To ensure Condition 2 is satisfied, it is 
sufficient to check that X[uq, fo] = 1 for each pair of distinct nodes uq and vq in 
to that corresponds to some pair of distinct leaves in p that are labeled by the 
same variable symbol. Suppose the leaves in to that corresponds to substituting 
a variable symbol Xi in p are li, ..., Ik- Then it suffices to check that X[li,l 2 ] = 
-^[^ 3 ,^ 4 ] = = X[lk-i,lk] = 1- Therefore, the target concept can be represented 

as a conjunctions of at most \p\ of the transformed attributes. This completes 
the proof of the claim. □ 

Combining the above Lemma with Theorem E and the technique used in 
Theorem 0 completes the proof of Theorem 0 □ 

Amoth, Cull and Tadepalli |2j have shown that DNF and the class of ordered 
forests with boundec0 label alphabet size and bounded number of children per 
node are equivalent. Hence, it seems unlikely that unions of arbitrary number of 
tree patterns can be learned in the mistake bound model. 

7 Conclusion 

In this paper, we demonstrated how learning unions of pattern languages and 
pattern-related concept can be reduced to learning disjunctions of boolean at- 
tributes. In particular, we presented efficient on-line algorithms for learning 
unions of a constant number of tree patterns, unions of a constant number of 
one- variable pattern languages, and unions of a constant number of pattern lan- 
guages with fixed length substitutions. All of our algorithms are robust against 
attribute noise and can be modified to handle concept drift. Further, our mistake 
bounds only have a logarithmic dependence on the length of the examples. The 
requirement that the learner be given a noise-free example for each pattern can 
be removed by sampling as discussed in Section 0 

There are several interesting future directions suggested by this work. As we 
have discussed, we could generalize the class of pattern languages by assigning 
a penalty (i.e. weight) to each violation of the rule that a terminal symbol 
cannot be changed. The weights can be different for different terminal symbols. 
Similarly, we can also assign a penalty to each violation of the rule that a pair of 
variable symbols, of the same variable, must be substituted by the same terminal 
string. If the penalty incurred by an instance for violating these rules is below 
a given tolerable threshold then it is in the target concept C'{p) generated by 
p. If the penalty is above the threshold then it is not in C'{p). It would be very 
interesting to explore applications for this extension and compare our approach 
to those currently in use. 

^ They showed that ordered forests can be learned using subset queries and equivalence 
queries. Further, if the alphabet size or number of children per node is unbounded, 
then subset queries can be simulated using membership queries by using a unique 
label or a subtree to stand for each variable. 
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In this paper, we solved one of the open problems suggested by Reischuk 
and Zeugmann m- Namely, we gave an efficient algorithm to learn unions of a 
constant number of one-variable pattern languages. We also were able to learn a 
unions constant number of pattern languages (with no restriction on the number 
of variables) when we restricted the substitutions to fixed length substitutions. A 
challenging open problem from Reischuk and Zeugmann that we did not resolve 
here is learning the class of 2-variable pattern languages (in the mistake bound 
model) . While, additional tools will be needed to solve this problem, we feel that 
the technique proposed here may be applicable for this problem. 
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